## Week 1 exercise

You are given a dataset of materials. This data contains both inter-metallic materials and ionic materials. The data also has a series of features that describe each material. Your task this week is to explore the relationships between the descriptors. Look for any suspicious data. Decide if any descriptors are very highly correlated. Reduce the dimensionality and look at some initial clustering of the data.

First thing you need to do is to upload the datafile (`training-data-week-1.pickle`) to colab. You will find this data online at: https://github.com/mdi-group/ai-for-chemistry-handson/blob/main/data/training-data-week-1.pickle

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## 1 Load and look at the data

* Use pandas to read `training-data-week-1.pickle`.
```
df = pd.read_pickle('training-data-week-1.pickle')
```
* Take a look at the top few entries of this data frame. See part 2 of *lecture-1-exploratory-data-analysis.ipynb*.
```
df.head()
```
* Explore some non-graphical summary statistics of the different columns using pandas. See part 2 of *lecture-1-exploratory-data-analysis.ipynb*.
```
df.describe()
```
* Get information about the types of data in each column using pandas. See part 2 of *lecture-1-exploratory-data-analysis.ipynb*.
```
df.info()
```

## 2 Graphical examination

Use seaborn to plot scatters of the different variables against each other.

```
import seaborn as sns
sns.pairplot(df)
```

## 3 Inspect the individual distributions

For each column calculate the skew and the kurtosis for each of the columns. Which data has the highest skew and the highest kurtosis?
You can get a list of columns using:

```
column_names = list(df.columns)
print(column_names)
```


Use the code for `skew` and `kurt` from lecture 1 [in the ebook](https://keeeto.github.io/ebook-data-analysis/lecture-1-exploratory-data-analysis.html).


## 4 Using boxplots 

Inspect each column and look for outliers 
* Make box plots for each of the columns
    * If you find a very serious outlier (say more than 1000 away from the mean value) drop it from the data - you will see the box plot disappear
    * If you find a bad outlier - remove that data
* Save the new clean dataframe to `week1-cleaned-data.pickle`
    

### Plot the box plots

Get the number of features : `print(len(list(df.columns)))`

In this case we have 9 features - so we can do a 3x3 plot. Use this code:
```
column_names = list(df.columns)
fig, ax = plt.subplots(3, 3, figsize=(10, 10))
for i in range(3):
    for j in range(3):
        data = df[df.columns[3*i + j]].values
        ax[i, j].boxplot(data)
        ax[i, j].set_title(df.columns[3*i + j])
plt.tight_layout()
```

### Identify outliers

Are there any box plots where there are points more than 1000 aaway from the mean?
If there are any - drop these rows from the data.

To drop the data, you can locate it using something like
```
df.drop(df[df['column name'] >= 1000].index, inplace = True)
```

When you have dropped the outliers, save the dataset:
```
df.to_pickle('week1-cleaned-data.pickle')
```

Do the 3x3 boxplots again and make sure outliers are gone.

## 5 Correlations

Obtain the pearson correlations between the different columns and inspect them.
Which columns seems to be most closely related - are there any possibly redundant columns that you might remove?

The code for plotting Pearson correlations in a heatmap is in lecture 1 [in the ebook](https://keeeto.github.io/ebook-data-analysis/lecture-1-exploratory-data-analysis.html).
. It is in the section *Explore correlations in the data*

 Drop correlated data. 
```
df.drop([<list of columns to drop>], inplace=True, axis=1)
```

Plot the heatmap again