## Recap on Pandas

Let's make a quick review on some important concepts of Pandas!

In [1]:
import pandas as pd

A company has information regarding their clients and would like to run some business analytics. Let's have a look at it!

In [2]:
company = pd.DataFrame({
    'Name':['Albert Einstein', 'Katniss Everdeen', 'Jeff Dean', 'Steve Jobs', 
            'Nelson Mandela', 'Karim Benzema', 'Angela Merkel', 'Bad Bunny'],
    'Age':[None, 17, 52, None, None, 33, 66, 27],
    'Gender':['Male', 'Female', 'Male', 'Male', 'Male', 'Male', "Female", None]})
company

Unnamed: 0,Name,Age,Gender
0,Albert Einstein,,Male
1,Katniss Everdeen,17.0,Female
2,Jeff Dean,52.0,Male
3,Steve Jobs,,Male
4,Nelson Mandela,,Male
5,Karim Benzema,33.0,Male
6,Angela Merkel,66.0,Female
7,Bad Bunny,27.0,


### Basic data analysis on the Company

Let's use `pandas`' function `.describe()` to get some basic stats of the dataset.

In [3]:
company.describe()

Unnamed: 0,Age
count,5.0
mean,39.0
std,19.761073
min,17.0
25%,27.0
50%,33.0
75%,52.0
max,66.0


These only appear for _Age_ since it is the only numeric attribute. At the same time, rows with the `NaN` value are not considered in the statistic. We will deal with these later.

Now, let's get the most common _Gender_ on the dataset. You can use the `.mode()` method for this.

In [4]:
company["Gender"].mode().item()

'Male'

### Missing values on the company

Some information from the company is missing. Let's identify the rows with any missing. For this, we use `isnull()`.

In [5]:
company[company.isnull().values]

Unnamed: 0,Name,Age,Gender
0,Albert Einstein,,Male
3,Steve Jobs,,Male
4,Nelson Mandela,,Male
7,Bad Bunny,27.0,


Now that we have identified the values, we can decide a policy to overcome the missingness. Let's fill the _Age_ with 0s, and drop the rows with unknown _Gender_.

In [6]:
company.fillna({'Age': 0}, inplace=True)
company.dropna(inplace=True)
company

Unnamed: 0,Name,Age,Gender
0,Albert Einstein,0.0,Male
1,Katniss Everdeen,17.0,Female
2,Jeff Dean,52.0,Male
3,Steve Jobs,0.0,Male
4,Nelson Mandela,0.0,Male
5,Karim Benzema,33.0,Male
6,Angela Merkel,66.0,Female


### Convert categorical variable into dummy/indicator variables.

Machine Learning algorithms don't know how to deal with categorial variables. These are usually transformed into binary variables that indicate the presence or absence of each categorical value. We will transform the _Gender_ attribute from categorial to dummy by applying the `.get_dummies()` function.

In [7]:
pd.get_dummies(company, columns=["Gender"])

Unnamed: 0,Name,Age,Gender_Female,Gender_Male
0,Albert Einstein,0.0,0,1
1,Katniss Everdeen,17.0,1,0
2,Jeff Dean,52.0,0,1
3,Steve Jobs,0.0,0,1
4,Nelson Mandela,0.0,0,1
5,Karim Benzema,33.0,0,1
6,Angela Merkel,66.0,1,0
