# Introduction to Data Science and Machine Learning

<p align="center">
    <img width="699" alt="image" src="https://user-images.githubusercontent.com/49638680/159042792-8510fbd1-c4ac-4a48-8320-bc6c1a49cdae.png">
</p>

---

## Data Encoding

In many practical Data Science activities, the data set will contain not only numerical variables. And even when numeric, not all numbers are equally meaningful.

Consider a the case where we assign $0$ or $1$ to a _flag_ variable, meaning that property is present or not, with no intermediate value nor a special ordering between $0$ and $1$.
We have already encountered also _categorical_ variables, these variables are typically stored as text values which represent various traits. Some examples include color (“Red”, “Yellow”, “Blue”), size (“Small”, “Medium”, “Large”) or geographic designations (State or Country).

Regardless of what the value is used for, the challenge is determining how to use this data in the analysis. Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. Therefore, we are faced with the challenge of figuring out how to turn these text attributes into numerical values for further processing.

We start by studying how python (and pandas) represents these objects and how we can deal with them.

As with many other aspects of the Data Science world, there is no single answer on how to approach this problem. 
Each approach has trade-offs and has potential impact on the outcome of the analysis. 
Fortunately, the python tools of pandas and scikit-learn provide several approaches that can be applied to transform the categorical data into suitable numeric values. 

In [1]:
# Import libraries
import pandas as pd
import numpy as np

# scikit-learn variables encoding
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, MinMaxScaler, LabelEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import SimpleImputer

# show sklearn objects in diagrams
from sklearn import set_config
set_config(print_changed_only=False, display="diagram")

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = "retina"

# set font and plot size to be larger
plt.rcParams.update({'font.size': 20, 'figure.figsize': (20, 13)})

### Data types in pandas

Let's start by an old friend: `pd.DataFrame.info()`.

In [8]:
data_url = 'https://raw.githubusercontent.com/fbagattini/Lezioni/master/data/OPSD_Germany_consumption.csv'
df = pd.read_csv(data_url)
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4383 entries, 0 to 4382
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         4383 non-null   object 
 1   Consumption  4383 non-null   float64
dtypes: float64(1), object(1)
memory usage: 321.1 KB


As anyone can notice, there is a float column (the computer representation of real numbers) and a the Date column (type `object`), namely strings for pandas.

#### The dataset

![img](https://raw.githubusercontent.com/fbagattini/Lezioni/master/img/OPSD.png)

[OPSD database](https://open-power-system-data.org/) is a dataset provided by the famous open source project OPSD that provides data on installed generation capacity by country/technology, individual power plants (conventional and renewable), and time series data. The latter includes electricity consumption, spot prices, and wind and solar generation, both measured and derived from weather models.

The aim of this very cool project is to make life easier for power system modelers. 
The goal is to help researchers focus on research and avoid redundant work when collecting, preparing, and aggregating data.

In particular, in this lecture we are going to import the `OPSD_Germany_consumption` dataset. This reports Germany's electric energy consumption (in GWh) from $2006$ to $2017$. These data are part of the OPSD project.

Let's look at the first records of the dataframe.

In [9]:
# Let's look at the first data
df.head()

Unnamed: 0,Date,Consumption
0,2006-01-01,1069.184
1,2006-01-02,1380.521
2,2006-01-03,1442.533
3,2006-01-04,1457.217
4,2006-01-05,1477.131


One can see the type of a column also by the attribute `dtype`.

In [10]:
df['Date'].dtype

dtype('O')

In Pandas, type 'O' stands for Object: a generic type, for our purposes we can consider it more broadly as a string.

In [11]:
df['Date'].loc[0]

'2006-01-01'

#### Time and timestamp

We want to give these data a structure. Meaning we want to be able to define an operation like

```python
df['Date'].loc[0] + year
```

having '2007-01-01' as result.

Hence, we aim at providing these records with a temporal structure.

We are going to explore a pandas method to perform such conversion `to_datetime`. This will result in a new type of object: `Timestamp`.

In [12]:
pd.to_datetime('10-02-09')

Timestamp('2009-10-02 00:00:00')

Note how we got not only the date, but also the time for free.

The function has attemped to autonomously infer the date format: October 2, 2009. Through optional arguments we can drive this transformation.

Can you tell the difference between the two following cells?

In [13]:
pd.to_datetime('10-02-09', yearfirst=True)

Timestamp('2010-02-09 00:00:00')

In [14]:
pd.to_datetime('10-02-09', dayfirst=True)

Timestamp('2009-02-10 00:00:00')

The `to_datetime` method can handle several formats.

In [15]:
pd.to_datetime('7th of June 1990')

Timestamp('1990-06-07 00:00:00')

In [16]:
pd.to_datetime('Feb 10 1990')

Timestamp('1990-02-10 00:00:00')

As with individual strings, we can convert lists of strings; in our instance, we can transform the whole `Date` column.

In [17]:
df.Date = pd.to_datetime(df.Date)

In [18]:
df

Unnamed: 0,Date,Consumption
0,2006-01-01,1069.18400
1,2006-01-02,1380.52100
2,2006-01-03,1442.53300
3,2006-01-04,1457.21700
4,2006-01-05,1477.13100
...,...,...
4378,2017-12-27,1263.94091
4379,2017-12-28,1299.86398
4380,2017-12-29,1295.08753
4381,2017-12-30,1215.44897


In [19]:
df.dtypes

Date           datetime64[ns]
Consumption           float64
dtype: object

It is noteworthy that python stores temporal objects as 64-bits integers, so that to handle nanoseconds (ns) precision.

**Recap**: we have transformed a column of the dataframe into a special one, equipped with temporal logic.

### DateTimeIndex

As you might have noticed, we have one example per day (this is called _frequency_). We will talk later about the definition of a time series, _i.e._ a _realisation_ of a stochastic process thanks to its projection on a time interval.

$$\{x_1, \ldots , x_n \vert x_k = x(t_k) \} \, .$$

For now, it might appear quite natural to use the time variable (the `Date` in this case) as an index. Pandas allows us to do so.

In [21]:
df = df.set_index('Date')

Now `df` is correctly indexed by `Date`. Let's have a look at the new index.

In [22]:
df.index

DatetimeIndex(['2006-01-01', '2006-01-02', '2006-01-03', '2006-01-04',
               '2006-01-05', '2006-01-06', '2006-01-07', '2006-01-08',
               '2006-01-09', '2006-01-10',
               ...
               '2017-12-22', '2017-12-23', '2017-12-24', '2017-12-25',
               '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29',
               '2017-12-30', '2017-12-31'],
              dtype='datetime64[ns]', name='Date', length=4383, freq=None)

It is not the usual `Index`, it is something more specific `DatetimeIndex`. Indeed, it has specific properties.

In [23]:
df.index.year

Int64Index([2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006,
            ...
            2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017],
           dtype='int64', name='Date', length=4383)

In [24]:
df.index.day

Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
            ...
            22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
           dtype='int64', name='Date', length=4383)

**Recap**: we have used the time information provided by our data (the date on which each record is observed) to index a dataframe.

### Indexing at loading time

Since the greatest force moving a developer is lazyness, we do not want to work on our dataframes in order to index them correctly. In other words we do not want to repeat the operations we have done above, _i.e._ converting to the right datetime format, creating a copy of our dataframe, changing the index to `Datetimeindex`, etc.

We can achieve the same goal in a compact way. Among the arguments of `read_csv`, there are:

* `parse_dates` 
* `index_col`

The former one accepts a boolean as value (hence `True` or `False`) setting whether to try parsing dates as `Datetime` objects, on the other hand the latter accepts a string or a list of strings indicating which columns should be used as index.

<p align="center">
    <img width="500" src="https://media.makeameme.org/created/ta-da.jpg">
</p>

In [25]:
df = pd.read_csv(data_url,
                 parse_dates=True,
                 index_col='Date')
df.head()

Unnamed: 0_level_0,Consumption
Date,Unnamed: 1_level_1
2006-01-01,1069.184
2006-01-02,1380.521
2006-01-03,1442.533
2006-01-04,1457.217
2006-01-05,1477.131


**Recap**: with just one line of code we have loaded the OPSD_Germany_consumption dataset and set Date column as its time index.

### String and partial-string indexing

Now that records have a time reference attached to, we can easily access them through string indexing.

Let's suppose we want to know the electric consumption of Christmas 2015.

In [None]:
df.loc['2015-12-25']

Consumption    1047.277
Name: 2015-12-25 00:00:00, dtype: float64

And we want to compare this record with the average consumption of December 2015. It is quite easy to do so in pandas.

As with slicing in Python and NumPy, string indexing can be used with ':' to access time intervals.

In [None]:
df.loc['2015-12-1':'2015-12-31']

Unnamed: 0_level_0,Consumption
Date,Unnamed: 1_level_1
2015-12-01,1588.021
2015-12-02,1585.308
2015-12-03,1577.457
2015-12-04,1570.318
2015-12-05,1337.095
2015-12-06,1232.073
2015-12-07,1536.251
2015-12-08,1572.74
2015-12-09,1586.393
2015-12-10,1596.593


**Important note**: A noteworthy difference between pandas and plain python slicing is that in pandas both extrema of the interval are included in the result.

Another temporal indexing technique is partial-string indexing. By specifying the month and year, we extract only the records of December 2015.

In [None]:
df.loc['2015-12']

Unnamed: 0_level_0,Consumption
Date,Unnamed: 1_level_1
2015-12-01,1588.021
2015-12-02,1585.308
2015-12-03,1577.457
2015-12-04,1570.318
2015-12-05,1337.095
2015-12-06,1232.073
2015-12-07,1536.251
2015-12-08,1572.74
2015-12-09,1586.393
2015-12-10,1596.593


And here is the average consumption of December 2015.

In [None]:
df.loc['2015-12'].mean()

Consumption    1375.545516
dtype: float64

**Recap**: we have learnt how to access the elements of a time indexed dataframe.

---

#### Exercises

Along with electric energy consumption, the [OPSD_Germany_all](https://raw.githubusercontent.com/fbagattini/Lezioni/master/data/OPSD_Germany_all.csv) dataset reports the daily production of solar and wind energy.

1. Load the dataset as a dataframe, using Date as time index. Solar energy production is not available until December 31, 2011; from the dataframe, select only the records following this date.
2. Create a column Renewable as the sum of solar and wind energy.
3. Compute the ratio between a) the total renewable production of September 2014 and b) the total electric consumption of the same month.
4. Using the indexing properties of Date (hint: `index.day_name()`), create the column 'Weekday' containing, for each record, the day of the week at which it's been observed (Monday, Tuesday,...).
5. Create the dataframe `df_sunday_wind` containing (only) Sunday wind production.
6. Compute the average Sunday wind production between January and March 2017 (included).
7. Use the tools we have seen so far to plot the consumption of electric energy.
8. Use the tools we have seen so far to plot the consumption of electric energy from November 2013 to June 2015.

---

### Categorical variables

Until now, we faced only numerical values in our analysis, but is very common to have categorical variables in the data. 
As mentioned, these variables are typically stored as text values which represent various traits. Some examples include color (“Red”, “Yellow”, “Blue”), size (“Small”, “Medium”, “Large”) or geographic designations (State or Country).

Let's read another dataset and try to work on this kind of variables.

In [27]:
df = pd.read_csv(filepath_or_buffer="datasets/categorical.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Gender     400 non-null    object
 1   Student    400 non-null    object
 2   Married    400 non-null    object
 3   Ethnicity  400 non-null    object
 4   Rating     400 non-null    object
dtypes: object(5)
memory usage: 15.8+ KB


As one can notice, we only have `object` values.

In [28]:
df.head()

Unnamed: 0,Gender,Student,Married,Ethnicity,Rating
0,Male,No,Yes,Caucasian,Low
1,Female,Yes,Yes,Asian,Premium
2,Male,No,No,Asian,Premium
3,Female,No,No,Asian,Premium
4,Male,No,Yes,Caucasian,Medium


To deal with categorical variables, we have first to inspect about their meaning and structure, we can have:

* __binary variables__: can have exactly two values, or categories (typically _flags_)
* __polytomous variables__: have more than two possible categories, between them:
    * __nominal variables__: there is no intrinsic ordering to the categories
    * __ordinal variables__: there is a clear ordering of the variables

#### Binary variables
This kind of categories can be easily encoded creating a [_dummy variable_](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)) for each category, that indicates with a value of zero or one the absence or presence of the attribute. This technique il called one hot encoding and is valid for binary and nominal variables.  
We can see a practical example using the variable `Gender` of the dataset, that can have only two values: _'Female'_ or _'Male'_.  

The objective is to make a transformation of the variable in order to obtain the following result:  


|    | Gender   |   Gender_Male |
|---:|:---------|--------------:|
|  0 | Male     |             1 |
|  1 | Female   |             0 |
|  2 | Male     |             1 |
|  3 | Female   |             0 |
|  4 | Male     |             1 |

Where the new column `Gender_Male` is valued $1$ when the record has the _masculinity_ property, $0$ otherwise.

We could in principle create a second column, called `Gender_Female` but in our case (the dataset is not taking into account non-binary genders) this would carry the exact same amount of information of `Gender_Male` one (with $1$ and $0$ exchanged).

Ona can encode binary categorical variables using `scikit-learn`. 
Indeed, the operation of encoding this kind of variables can be done using the [`OneHotEncoder` class object](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

In order to do so, let's instantiate the object.

In [33]:
ohe = OneHotEncoder(sparse=False, dtype=int) # Instance of the OneHotEncoder object

The scikit-learn encoders need to be _fitted_ on data, _i.e._ they want to see data and establish which values they have to encode. In this case to which values they have to substitute $0$ and $1$.

With this operation, the encoder reads and sorts the unique values in the variable and creates a rule of conversion to numeric values.
Scikit-learn is quite flexible, meaning we can pass it a lot of different iterable types, however, we suggest to always pass numpy arrays, since at the end scikit-learn always tries to convert into such a form.

In [37]:
gender_array = df.Gender.values.reshape(-1, 1) # Assign this to a variable for the sake of readability
ohe.fit(X = gender_array) # `values` attribute return the np.array related to pd.Series or pd.DataFrame

We have now instantiated the encoder and created the conversion rules. We did not apply the rule on anything yet. 
We want to use the method `transform` to make the encoding of the variable and assign the result to the a new variable that we store in the column `gender_enc`. 

The object returned by the method is a numpy array with $2$ columns, for both female and male categories.

In [38]:
gender_enc = ohe.transform(X=gender_array)

print(f"gender encoded -> type object: {type(gender_enc)}, shape: {gender_enc.shape}")

gender encoded -> type object: <class 'numpy.ndarray'>, shape: (400, 2)


Of course, we can use the fitted encoder to obtain the original values from the encoded data
To do this, we can use the method `inverse_transform` of the encoded object to the encoded data.

In [39]:
gender_inv = ohe.inverse_transform(X=gender_enc)

Naturally, now we want to compare original, encoded and inverted data.

As we can see from the results, the encoded variable contains two columns: the first column indicates with 1 or 0 the absence or presence of the _Male_ value, while the second column is for the _Female_ category. 

However, we know that one of the assumption of the model studied since now is the lack of perfect [multicollinearity in the predictors](https://www.statology.org/perfect-multicollinearity/#:~:text=If%20the%20degree%20of%20correlation,exact%20linear%20relationship%20between%20them.), but now the encoded columns have this problem because every time that one column has a value, the other column's value changes.

In [62]:
print("original \t encoded \t inverted")

for original, encoded, inverted in zip(gender_array[:10, :], gender_enc[:10, :], gender_inv[:10, :]):
    print(f"{original} \t {encoded} \t\t {inverted}")

original 	 encoded 	 inverted
['Male'] 	 [1] 		 ['Male']
['Female'] 	 [0] 		 ['Female']
['Male'] 	 [1] 		 ['Male']
['Female'] 	 [0] 		 ['Female']
['Male'] 	 [1] 		 ['Male']
['Male'] 	 [1] 		 ['Male']
['Female'] 	 [0] 		 ['Female']
['Male'] 	 [1] 		 ['Male']
['Female'] 	 [0] 		 ['Female']
['Female'] 	 [0] 		 ['Female']


##### Multicollinearity issue
How to face the multicollinearity problem?

It is quite easy to face this issue: we can drop one of the two columns created by the encoder. 
To make this operation in an easy way, we can instance the encoder setting the attribute `drop="first"`.

Perform this operation have two benefits:

1. avoid the multicollinearity problem;
2. have the resulting array one dimension smaller.

As we can see from the operation below, the array contains only the dummy variable for `Male` category.

In [43]:
ohe = OneHotEncoder(drop="first", sparse=False, dtype=int)
gender_enc = ohe.fit_transform(X=gender_array)
print(ohe.get_feature_names_out())
print(gender_enc[:10, :])

['x0_Male']
[[1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]]


### Nominal variables

The nominal variables are treated with the same technique saw before for binary variables, the main difference is that the encoder creates $n-1$ dummy variables from the original variable with $n$ categories. 
We can see a practical example using the variable `Ethnicity` of the dataset, that can have only three values: 'African American', 'Asian' and 'Caucasian'.

The objective is to make a  variable transformation in order to obtain the following result

|    | Ethnicity   |   Ethnicity_Asian |   Ethnicity_Caucasian |
|---:|:------------|------------------:|----------------------:|
|  0 | Caucasian   |                 0 |                     1 |
|  1 | Asian       |                 1 |                     0 |
|  2 | Asian       |                 1 |                     0 |
|  3 | Asian       |                 1 |                     0 |
|  4 | Caucasian   |                 0 |                     1 |

_Quick question_: How the 'African American' category has been encoded?

In [44]:
# Definition of the variable ethnicity containing the values to encode
ethnicity_array = df[["Ethnicity"]].values

We can now fit the encoder on the suitable values. There exists an encoder method called `fit_transform` that in the same line of code, fits the encoder and then applies the transformation rules to the data.

In [45]:
ohe = OneHotEncoder(drop="first", sparse=False, dtype=int)
ethnicity_enc = ohe.fit_transform(X=ethnicity_array)

As above, we want to compare original, encoded and inverted data.

As the results show, the encoded variable contains two columns: the first column indicates with $1$ or $0$ the absence or presence of the Asian value, while the second column is for the Caucasian category. The presence of the attribute African American can be obtained by the linear combination of the other two column: in practise the value is present where the first two columns have both a value of $0$.

In [49]:
print("original \t\t\t encoded")

for original, encoded in zip(ethnicity_array[:10, :], ethnicity_enc[:10, :]):
    print(f"{original} \t\t\t {encoded}")

original 			 encoded
['Caucasian'] 			 [0 1]
['Asian'] 			 [1 0]
['Asian'] 			 [1 0]
['Asian'] 			 [1 0]
['Caucasian'] 			 [0 1]
['Caucasian'] 			 [0 1]
['African American'] 			 [0 0]
['Asian'] 			 [1 0]
['Caucasian'] 			 [0 1]
['African American'] 			 [0 0]


### Encode binary and nominal categorical variables using pandas

It is possible to perform the same operations seen above using the function `get_dummies` of pandas, indicating the dataframe and the columns to transform. 
The use of this function is very useful if we want to encode the categorical columns and maintain the pandas dataframe structure, but we have to remember that operating in the whole dataset could lead to data leakage phenomena.

Definition of the function:

```python 
pandas.get_dummies(
    data,
    prefix=None,
    prefix_sep='_',
    dummy_na=False,
    columns=None,
    sparse=False,
    drop_first=False,
    dtype=None,
) -> 'DataFrame'
```

Let's see it at work.

In [50]:
df_enc = pd.get_dummies(data=df, columns=["Gender", "Student", "Married", "Ethnicity"], drop_first=True)

df_enc

Unnamed: 0,Rating,Gender_Male,Student_Yes,Married_Yes,Ethnicity_Asian,Ethnicity_Caucasian
0,Low,1,0,1,0,1
1,Premium,0,1,1,1,0
2,Premium,1,0,0,1,0
3,Premium,0,0,0,1,0
4,Medium,1,0,1,0,1
...,...,...,...,...,...,...
395,Medium,1,0,1,0,1
396,Low,1,0,0,0,0
397,Medium,0,0,1,0,1
398,Very low,1,0,1,0,1


### Sparsity of matrices

You may have noticed how the number of columns is increasing. Try to imagine how would transform a dataframe originally with three categorical column each with $37$ categories.

This is one of the main problems of One Hot Encoding.

### Ordinal variables

The ordinal categories are one the most easy to encode because they have a logical order that can be used to map numerically the categories. 
The advantage of this kind of variables is that we can encode various categories in only one dimension without lose information.

The objective is to make a transformation of the variable in order to obtain the following result

|    | Rating   |   Rating Encoded |
|---:|:---------|-----------------:|
|  0 | Low      |                1 |
|  1 | Premium  |                4 |
|  2 | Premium  |                4 |
|  3 | Premium  |                4 |
|  4 | Medium   |                2 |

There exists a specific object in scikit-learn for this, called (with a lack of imagination) `OrdinalEncoder`.

In [51]:
# Define array of categorical variable
rating_array = df[["Rating"]].values

__Careful__: Check the order of the categories!

Since the values are stored as strings, they are sorted in alphabetical order. The alphabetical order of the categories does not follow the logical one.

In [53]:
print(np.unique(ar=rating_array))

['High' 'Low' 'Medium' 'Premium' 'Very low']


Hence, let's encode ordinal categorical variables using scikit-learn.

The operation of encoding this kind of variable can be done using the [`OrdinalEncoder` class object](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html), that encodes the variable make an alphabetical sorting of the categories, this means that if the categories does not follow this kind of logic, the encoding might be wrong. 
To perform the operation of encoding, we instance the encoder object using the data with custom sorting and assign to the parameter categories.

In [54]:
# Define categories in the right order
categories = [['Very low', 'Low', 'Medium', 'High', 'Premium']]

oe = OrdinalEncoder(categories=categories, dtype=int)
oe

We are now ready to fit and transform of the encoder on data.

In [55]:
rating_enc = oe.fit_transform(X=rating_array)

Let's have a look at the encoded categories. This allows also to check whether the order is correct.

In [56]:
print(oe.categories_)

[array(['Very low', 'Low', 'Medium', 'High', 'Premium'], dtype=object)]


We can use the fitted encoder to obtain the original values from the encoded data.

As per usual, to do this, we can use the method `inverse_transform` of the encoded object to the encoded data.

In [57]:
rating_inv = oe.inverse_transform(rating_enc)

And again, let's compare original, encoded and inverted data.

In [59]:
print("original \t encoded \t inverted")

for original, encoded, inverted in zip(rating_array[:10, :], rating_enc[:10, :], rating_inv[:10, :]):
    print(f"{original} \t {encoded} \t\t {inverted}")

original 	 encoded 	 inverted
['Low'] 	 [1] 		 ['Low']
['Premium'] 	 [4] 		 ['Premium']
['Premium'] 	 [4] 		 ['Premium']
['Premium'] 	 [4] 		 ['Premium']
['Medium'] 	 [2] 		 ['Medium']
['Premium'] 	 [4] 		 ['Premium']
['Low'] 	 [1] 		 ['Low']
['Premium'] 	 [4] 		 ['Premium']
['Low'] 	 [1] 		 ['Low']
['Premium'] 	 [4] 		 ['Premium']


### Pandas encoding

Of course, also in pandas there is a way to encode ordinal variables, however it does not exists a specific method to manipulate the ordinal categorical variables. 
It's possible to perform this operation creating a conversion dictionary and apply the encoding to the variable with the `map` method.

In [63]:
map_rating = {
    'Very low': 0, 
    'Low': 1,
    'Medium': 2,
    'High': 3,
    'Premium': 4
}

df_enc["Rating"] = df_enc["Rating"].map(map_rating)

In [64]:
df_enc

Unnamed: 0,Rating,Gender_Male,Student_Yes,Married_Yes,Ethnicity_Asian,Ethnicity_Caucasian
0,1,1,0,1,0,1
1,4,0,1,1,1,0
2,4,1,0,0,1,0
3,4,0,0,0,1,0
4,2,1,0,1,0,1
...,...,...,...,...,...,...
395,2,1,0,1,0,1
396,1,1,0,0,0,0
397,2,0,0,1,0,1
398,0,1,0,1,0,1


#### Transform heterogeneous data using scikit-learn

When we have to encode different kinds of variables could be difficult to encode them properly. 

In these situations `scikit-learn` provides two composers that allows to transform different columns simultaneously called [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html) and [`ColumnTransformer`]( https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html). 

These objects accepts every imputer, transformer and encoder (and other objects too) that have the methods fit and transform.  

One of the advantages of this composer is that applies different transformers (also _Scalers_ or _Imputers_) to columns of a numpy array or pandas DataFrame. 
For this example, we transform directly the dataframe columns without make any conversion of the data to a numpy array.

In [65]:
# Import a new dataframe
df = pd.read_csv("datasets/heterogeneous_data.csv")

In [66]:
df.head(10)

Unnamed: 0,Gender,Student,Married,Ethnicity,Rating,Age,Income,Education Years
0,Male,No,Yes,Caucasian,Low,34,14.891,11
1,Female,Yes,Yes,Asian,Premium,82,106.025,15
2,Male,No,No,Asian,Premium,71,104.593,11
3,Female,No,No,Asian,Premium,36,148.924,11
4,Male,No,Yes,Caucasian,Medium,68,55.882,16
5,Male,No,No,Caucasian,Premium,77,80.18,10
6,Female,No,No,African American,Low,37,20.996,12
7,Male,No,No,Asian,Premium,87,71.408,9
8,Female,No,No,Caucasian,Low,66,15.125,13
9,Female,Yes,Yes,African American,Premium,41,71.061,19


#### Using `ColumnTransformer`

During the definition of the composer, inside the parameter `transformers` we have to define a list of tuples that includes for each transformer the name, the transformer object and the columns to transform.  

Definition of the object:

```python 
ColumnTransformer(
    transformers,
    remainder='drop',
    sparse_threshold=0.3,
    n_jobs=None,
    transformer_weights=None,
    verbose=False,
)
```

In [67]:
categories = [['Very low', 'Low', 'Medium', 'High', 'Premium']]

ct = ColumnTransformer(transformers=[
    ("MinMax", MinMaxScaler(), ["Age", "Income", "Education Years"]),
    ("OneHot", OneHotEncoder(drop="first", sparse=False, dtype=int), ["Gender", "Student", "Married", "Ethnicity"]),
    ("Ordinal", OrdinalEncoder(categories=categories, dtype=int), ["Rating"])
])
ct

##### Fit and transform of the categorical data

Like the others `scikit-learn` objects, the `fit_transform` method of the composer returns a numpy array.

In [69]:
array_enc = ct.fit_transform(df)

print(array_enc.shape)

(400, 9)


##### Using `make_column_transformer`  

The main difference between the two transformers is that the function `make_column_transformer` want tuples in input and doesn't need the name of the composing transformers. 

Definition of the function:

```python
make_column_transformer(*transformers, **kwargs)
```

In [70]:
categories = [['Very low', 'Low', 'Medium', 'High', 'Premium']]

mct = make_column_transformer(
    (MinMaxScaler(), ["Age", "Income", "Education Years"]),
    (OneHotEncoder(drop="first", sparse=False, dtype=int), ["Gender", "Student", "Married", "Ethnicity"]),
    (OrdinalEncoder(categories=categories, dtype=int), ["Rating"])
)
mct

In [71]:
array_enc = mct.fit_transform(df)

print(array_enc.shape)

(400, 9)


#### How interpret the coefficients of categorical variables

The logic behind the interpretation of categorical variables is very similar to that used for numerical variables.

1. The category with zero value goes to the intercept.
2. The coefficients relative to the other categories show the variation that occurs to have $y=1$ with that category included in the predictors matrix.

#### Bonus: Handle missing values

We have already seen how it is likely (depending on the problem) to have missing values in our data. 
Let's introduce some missing values in the columns `Age` and `Rating`.

In [72]:
# Add missin values randomly in the 5% of the records.
for column in ["Age", "Rating"]:
    df.loc[df.sample(frac=0.05).index, column] = np.nan

Let's check for missing value in our dataframe.

In [73]:
df.isnull().sum()

Gender              0
Student             0
Married             0
Ethnicity           0
Rating             20
Age                20
Income              0
Education Years     0
dtype: int64

Recall that when there are some missing values in the dataset there are two strategies that can be followed:

1. drop the rows or columns with the missing values.
2. impute the values with another one.

1. To drop the rows that contains missing values we can easily use the method `dropna`. 

In [74]:
df.dropna()

Unnamed: 0,Gender,Student,Married,Ethnicity,Rating,Age,Income,Education Years
0,Male,No,Yes,Caucasian,Low,34.0,14.891,11
1,Female,Yes,Yes,Asian,Premium,82.0,106.025,15
2,Male,No,No,Asian,Premium,71.0,104.593,11
3,Female,No,No,Asian,Premium,36.0,148.924,11
4,Male,No,Yes,Caucasian,Medium,68.0,55.882,16
...,...,...,...,...,...,...,...,...
395,Male,No,Yes,Caucasian,Medium,32.0,12.096,13
396,Male,No,No,African American,Low,65.0,13.364,17
397,Female,No,Yes,Caucasian,Medium,67.0,57.872,12
398,Male,No,Yes,Caucasian,Very low,44.0,37.728,13


2. To impute the missing values in pandas we can use the method `fillna`.

In this case, we want to use the mode to impute the `Rating` column and the mean to impute the `Age` column.

In [75]:
rating_mode = df["Rating"].mode()[0]
age_mean = df["Age"].mean()

print(f"Rating mode: '{rating_mode}', Age mean: {age_mean:.4f}")

Rating mode: 'Low', Age mean: 55.8895


In [76]:
df.fillna({"Rating": rating_mode, "Age": age_mean})

Unnamed: 0,Gender,Student,Married,Ethnicity,Rating,Age,Income,Education Years
0,Male,No,Yes,Caucasian,Low,34.0,14.891,11
1,Female,Yes,Yes,Asian,Premium,82.0,106.025,15
2,Male,No,No,Asian,Premium,71.0,104.593,11
3,Female,No,No,Asian,Premium,36.0,148.924,11
4,Male,No,Yes,Caucasian,Medium,68.0,55.882,16
...,...,...,...,...,...,...,...,...
395,Male,No,Yes,Caucasian,Medium,32.0,12.096,13
396,Male,No,No,African American,Low,65.0,13.364,17
397,Female,No,Yes,Caucasian,Medium,67.0,57.872,12
398,Male,No,Yes,Caucasian,Very low,44.0,37.728,13


It is of course possible to impute the missing values using scikit-learn.

The [`imputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) object handles missing values with numerical and categorical data.

In [77]:
# Definition of the array of a variable with missing values
age = df[["Age"]].values

In [78]:
# Instance of the imputer object
si = SimpleImputer()
si

In [79]:
# Fit and transform of the imputer to the data
age_imputed = si.fit_transform(age)
age_imputed[:10, :]

array([[34.        ],
       [82.        ],
       [71.        ],
       [36.        ],
       [68.        ],
       [77.        ],
       [37.        ],
       [55.88947368],
       [66.        ],
       [55.88947368]])

Using the attribute ```statistics_``` is possible to see the value used to impute the missing values

In [80]:
si.statistics_

array([55.88947368])

What is left is the imputation of the missing values in the Rating column.

We can impute the missing values in categorical columns, for instance, setting the `strategy="most_frequent"`.

In [81]:
rating = df[["Rating"]].values
si = SimpleImputer(strategy="most_frequent")
rating_imputed = si.fit_transform(rating)
rating_imputed[:10, :]

array([['Low'],
       ['Premium'],
       ['Premium'],
       ['Premium'],
       ['Medium'],
       ['Premium'],
       ['Low'],
       ['Premium'],
       ['Low'],
       ['Premium']], dtype=object)

With these strategy we can now proceed to numerical analysis, plots, etc.

---

#### Exercise 
Read the `heterogeneous_data.csv` file in the datasets folder, introduce missing values in two columns (one numerical and one categorical) and make all the transformations required using one column transformer.

After that study the dataset with the techniques we have explored in the [Exploratory Data Analysis lecture]()