# Part 1: Data Preprocessing

## 1. Data cleansing

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

First need to import the data:

In [None]:
path = "chimera_data_not_cleaned.csv"
df = pd.read_csv(path, sep = ",")

Plot a histogram of the `age` column of the dataset. Do the values look reasonable to you? Do the same for `salary`.

Use the function `.unique()` to take a look at the values taken on by `education`, `years_since_promotion`, and `exit`.

Use the function `.duplicated()` and conditioning to to detect if there are any duplicates.

Use the function `.drop_duplicates(inplace=True)` to obtain a new dataset with no duplicates.

Use the function `.isna()` and conditioning to detect if there are any empty cells.

If we want to drop entire rows with NA values, we can simply use
`df.dropna(axis = 0,how="any")`

Use the function `.nunique()` to find the number of unique entries for all columns.

Drop any columns with only one value using `.drop(columns=['name1','name2'])`.

## 2. Scaling/Normalization

In [None]:
from sklearn import preprocessing

In [None]:
X = np.array([[ 1., -1.,  2.],
                [ 2.,  0.,  0.],
                [ 0.,  1., -1.]])

Does the code given next normalize or scale the data?

In [None]:
min_max_scaler = preprocessing.MinMaxScaler()
X_minmax = min_max_scaler.fit_transform(X)
X_minmax

Does the code given next normalize or scale the data?

In [None]:
X_scaled = preprocessing.scale(X)
X_scaled

## 3. Data imputation 

This can be done in Python using the following library. See the documentation at https://scikit-learn.org/stable/modules/impute.html

In [None]:
from sklearn import impute

For example, we can replace missing values by the mean:

In [None]:
X = np.array([[ 1., np.nan,  2.],
                [ 2.,  0.,  np.nan],
                [ 0.,  1., -1.]])

In [None]:
imp = impute.SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(X)
print(imp.transform(X))

In [None]:
imp = impute.SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(df)
dfnew = pd.DataFrame(imp.transform(df))
dfnew.columns = df.columns
print(dfnew)

Do you see why this could be a problem? Try out the following:

In [None]:
dfnew['exit'].unique()

For now, we will simply remove all rows with missing values

In [None]:
print(df.shape)
df.dropna(axis=0,inplace=True)
print(df.shape)

## 4. Outliers

Using `seaborn`, plot a boxplot of `salary` as `exit` varies. Are there any outliers?

Using the function `stats.zscore(df[Column])` compute the z-score table for `salary`. Are there any outliers?

In [None]:
from scipy import stats



What data structure is obtained here? Find the index of the outlier in this case using `np.where` and logical conditions. Do the two indexes correspond?

Let's now look at `boss_survey` as `exit` varies:

Use the function `np.nanquantile(column,quantile)` to find the 5% quantile of `boss_survey` results within the employees exiting the firm. Then, take a look at all of the employees leaving the firm who have a `boss_survey` result at or below this 5% quantile.

# Part 2: Feature Engineering

## 1. Numerical to Categorical
We start with ordinal then move onto one-hot encoding.

1. Ordinal encoding

In [None]:
from sklearn.preprocessing import OrdinalEncoder

dftest = pd.DataFrame({'size':['small','medium','large','small','large','medium']})
dftest

In [None]:
encoder = OrdinalEncoder(categories=[['small','medium','large']]) 
dfnew = encoder.fit_transform(dftest) # transform data
pd.DataFrame(data=dfnew, columns=dftest.columns)

2. One-hot encoding

In [None]:
dftest = pd.DataFrame({'color':['green','red','red','blue','green','red']})
dftest

In [None]:
pd.get_dummies(dftest,drop_first=True,columns=['color'])

## 2. Feature transforms

In [None]:
dfcopy = df.copy()

Use `.apply(np.log)` to transform the column `salary` of `dfcopy` from itself to the log of itself.

Print both df and dfcopy. Check their `salary` column. Has your transform had the effect you wanted?

We can also make a graphical comparison:

Create a new feature in the dfcopy dataset consisting of the `boss_tenure_percentage`=`boss_tenure` / `tenure`. Then, print a histogram, using `seaborn`.

# Part 3: Exercises

## Exercise 1: Iceberg right ahead!

Next term, we will use the Titanic dataset, available at https://www.kaggle.com/c/titanic/data 
This is historic data containing the passengers present on the Titanic and some of their features (whether, e.g., they had family on board or not, their cabin numbers, etc.) and whether or not they survived the boat sinking.
Our goal is to clean up this dataset in view of using it later down the line. The dataset we will be cleaning up is `titanic_train.csv`.

In [None]:
titanic_train=pd.read_csv("titanic_train.csv")

1. Observe the header of the dataset. What do SibSp and Parch represent?

2. Are there any duplicates in the dataset? Why are we doing this before dropping any columns?

3. Drop a couple of columns from the datasets: PasssengerID, Name, and Ticket Number. We drop Ticket Number and PassengerId as they don't have much informative value. The Name could have some information in it (e.g., nobility, married or not, etc.) but that would require natural language processing, which we will not use on the dataset.

4. Let's check for inconsistencies in the numerical data using `.hist()`. Does anything seem abnormal to you?

There may be some outliers but it seems like the values obtained are coherent.

5. What values do the categorical variables (this includes the cabin number) take on? Is there anything irregular there? Make sure you understand their output.

6. We now deal with the missing values. Which features are missing information?

7. Which percentage of Age/Cabin/Embarked are missing? Use `.shape` to find this. In consequence, what should you with the Cabin column?

8. For the Age column, we use a nearest neighbor approach. Use `KNNImputer` to fill in the missing values. Check that there are no more missing values in the Age column.

9. For the Embarked column, use `countplot` in the `seaborn` package to obtain the number of people who embarked at `S`, `C`, and `Q`. Where did the overwhelming majority of passengers embark? Use `SimpleImputer` to simply replace all missing values with the most frequent one. Check that no more entries are missing.

10. The division into Parch and SibSp is quite random. We regroup this column into one column called `Family_Presence`. Create a new column in the dataframe called `Family_Presence` which contains 1 if either SibSp is equal to 1 or Parch is equal to 1. Then drop `Parch` and `SibSp`. Hint: Use `np.where(condition,1,0)` where `condition` is the logical condition needed to be satisfied.

11. Finally, replace all categorical variables by numerical ones. We are ready to go!

## Exercise 2:

Recall the notion of scaling and consider a feature for which we have many observations.

1. Show that if we take the feature vector, subtract its mean and divide by its standard deviation, then the new feature vector obtained is scaled, i.e., has mean 0 and standard deviation 1.

2. Check your answer on the first column of the np.array X below:

In [None]:
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])

## Exercise 3: Motorcycle Helmets with Bluetooth

See the exercise description within Moodle!

1. We first create the two dataframes.

2. We now plot both dataframes on the same graph using seaborn.

3. We add a new column to each dataframe by taking the log-transform of the supply/demand.

4. We can estimate the slopes and intercepts quite easily: slopes=rise/run and intercept=y-axis - slope * x-axis.

5. We can now solve for when they cross:

6. a. We construct two lists.

6.b. 

6.c. 