# Discover 

This step usually contains the following points:

1. Obtain
2. Clean
3. Explore
4. Establish baseline outcomes
5. Hypothesize solutions

The steps that usually require some amount of Python coding are 1, 2 and 3, which is what we will review in the follwing sections.

## Imports

```python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
```

## Load the data

First, the use of Pandas to import the (in this case static) files that will be used for analysis. We will probably have to import a complete dataset of a dataset for training and testing separately.

```python
train_df = pd.read_csv(path_to_file)
test_df = pd.read_csv(path_to_file)
```

Also, we will need to set appart our features and target variable:

```python
features = ['feature1', 'feature2', ...]
target = ['feature_target']

train_features = train_df[features]
train_target = train_df[target]
```

## Inspect the dataset 

Some steps can be carried out here, including but not limited to:

### Checking head and tail

```python
train_features.head(10)
train_features.tail(10)
```

### Use of `info()` to see types and details

```python
train_features.info()
```

### Check for duplicates

```python
train_features.duplicated().sum()
```

This will output the total amount of ducplicated records.

### Identify numerical and categorical variables

```python
train_features.columns

numerical = ['feature1', 'feature2', ...]
categorical = ['feature10', 'feature11', ...]
```

This could be further expanded by using more data categories. See [7 Data Types: A Better Way to Think about Data Types for Machine Learning](https://towardsdatascience.com/7-data-types-a-better-way-to-think-about-data-types-for-machine-learning-939fae99a689)

### Summarize numerical and categorical variables

```python
# Summarize numerical
train_features.describe(include=[np.number])

# Summarize categorical (objects)
train_features.describe(include=['O'])

```

## Explore the variables

### Visualize distribution of target variable

```python
#Boxplot
plt.boxplot(train_target)

# Displot
sns.displot(train_target)
```

### Use of IQR rule to detect potential outliers

```python
target_stats = train_target.describe()

# Extract upper and lower bounds
IQR = target_stats['75%'] - target_stats['25%']
upper = target_stats['75%'] + 1.5 * IQR
lower = target_stats['25%'] - 1.5 * IQR
```

### Slice the dataframe to explore potential outliers

Use the upper and lower bound to extract outliers.

```python
train_target[train_target.outcome < lower]

train_target[train_target.outcome > upper]
```
This process continues as some discoveries about the variables are made.

## Plot the variables 

Use of different visualizations for exploring and getting to know the dataset, hopefully inside a function that generalizes this process, as follows:

```python
def plot_feature(df, col):
    '''
    Make plot for each featuresleft, the distribution of samples on the feature
    right, the dependance of any variable on the target
    '''
    plt.figure(figsize = (14, 6))
    plt.subplot(1, 2, 1)
    if df[col].dtype == 'int64':
        pass
    else:
        pass
    
    (...)
```

## Explore variables correlation

This step is useful in order to evaluate possibly correlated variables and reduce dimensionality of the dataset to be modeled. Some steps are:

### Correlation matrix

```python
plt.matshow(train_features[numerical].corr(),interpolation='none')
```

### MDS plot using PCA

```python
from sklearn.decomposition import PCA

#Get correlation matrix
similarities = train_features[numerical].corr()

# Instantiate PCA
pca = PCA(n_components=2)
corr_pca = pca.fit_transform(similarities)

#Plot the matrix
_, ax = plt.subplots(1,1)

sns.scatterplot(x=corr_pca[:,0], y=corr_pca[:,1], ax=ax)

for i, txt in enumerate(train_features[numerical].columns):
    ax.annotate(txt, (corr_pca[i,0], corr_pca[i,1]))
```