# PyCon2019: Hello World of Machine Learning using Scikit-learn


## [9] - Managing Real World M.L. Data with Pandas

Data Science and Machine Learning feeds on Data. But we don't (and we can't) dump the raw data to the ML Models. We need to clean and prune it (remember 80/20 rule)

- Pruning and Cleaning of raw data
- Generate data in the form required by ML models
- interrogation of data to extract valuable information

In [1]:
import pandas as pd
import numpy as np

#### Pandas Data Structures

- Pandas DataFrame
- Pandas Series

### The Pandas DataFrame

A 2-D labeled data structure with columns of potentially different type.  

_In other words... Pandas DataFrame is just like an Excel Sheet with rows and columns of Data_

![alt text](exceldata.png "DataFrame")

#### Pandas DataFrame can be created From

- csv
- excel
- pickle
- clipboard
- JSON
- HTML
- In Memory HDFStore: PyTables
- SAS
- SQL
- Google Bigquery

### Creating DataFrame using CSV files

Let's create a CSV first

In [2]:
my_dict = { 'name' : ["a", "b", "c", "d", "e","f", None],
                   'age' : [20,27, 35, None, 18, None, 35],
                   'gender' :  ['Male', 'Female','Male','Female','Male', 'Female', 'Male'],
                   'designation': ["VP", "CEO", None, "VP", "VP", "CEO", "MD"]}

In [3]:
df_csv = pd.DataFrame(my_dict)

In [4]:
df_csv

Unnamed: 0,name,age,gender,designation
0,a,20.0,Male,VP
1,b,27.0,Female,CEO
2,c,35.0,Male,
3,d,,Female,VP
4,e,18.0,Male,VP
5,f,,Female,CEO
6,,35.0,Male,MD


In [5]:
df_csv.to_csv('demo.csv', index=False)

___Reading from CSV___

In [6]:
df = pd.read_csv('demo.csv')

In [7]:
df

Unnamed: 0,name,age,gender,designation
0,a,20.0,Male,VP
1,b,27.0,Female,CEO
2,c,35.0,Male,
3,d,,Female,VP
4,e,18.0,Male,VP
5,f,,Female,CEO
6,,35.0,Male,MD


#### Visualize the Data

In [8]:
df.head()

Unnamed: 0,name,age,gender,designation
0,a,20.0,Male,VP
1,b,27.0,Female,CEO
2,c,35.0,Male,
3,d,,Female,VP
4,e,18.0,Male,VP


In [9]:
df.tail()

Unnamed: 0,name,age,gender,designation
2,c,35.0,Male,
3,d,,Female,VP
4,e,18.0,Male,VP
5,f,,Female,CEO
6,,35.0,Male,MD


In [10]:
df.shape

(7, 4)

In [11]:
df.age.sum()

135.0

In [12]:
df['age'].sum()

135.0

In [13]:
df.age.mean()

27.0

In [14]:
df.age.dtype

dtype('float64')

___A single column data though called a Pandas Series, But is similar to NumPy array for all practical purpose___

In [15]:
data = df['age']

In [16]:
data.dtype, data.shape, data.ndim, type(data)

(dtype('float64'), (7,), 1, pandas.core.series.Series)

___Getting Valid Data___

In [17]:
df.isnull()

Unnamed: 0,name,age,gender,designation
0,False,False,False,False
1,False,False,False,False
2,False,False,False,True
3,False,True,False,False
4,False,False,False,False
5,False,True,False,False
6,True,False,False,False


In [18]:
df.isnull().sum()

name           1
age            2
gender         0
designation    1
dtype: int64

#### iloc, and loc

- iloc : Selecting the data via row numbers
- loc : Selecting data by level or condition

In [19]:
df.iloc[2]

name              c
age              35
gender         Male
designation     NaN
Name: 2, dtype: object

In [20]:
df.iloc[0:4]

Unnamed: 0,name,age,gender,designation
0,a,20.0,Male,VP
1,b,27.0,Female,CEO
2,c,35.0,Male,
3,d,,Female,VP


In [21]:
df.iloc[0:4, 0:2]

Unnamed: 0,name,age
0,a,20.0
1,b,27.0
2,c,35.0
3,d,


In [22]:
df.loc[df['age'] > 30]

Unnamed: 0,name,age,gender,designation
2,c,35.0,Male,
6,,35.0,Male,MD


In [23]:
df.loc[df['age'].isnull()]

Unnamed: 0,name,age,gender,designation
3,d,,Female,VP
5,f,,Female,CEO


In [24]:
df

Unnamed: 0,name,age,gender,designation
0,a,20.0,Male,VP
1,b,27.0,Female,CEO
2,c,35.0,Male,
3,d,,Female,VP
4,e,18.0,Male,VP
5,f,,Female,CEO
6,,35.0,Male,MD


<br/>
<br/><br/>

___Handling missing Data___

<br/><br/>

_It's not a great idea to delete the rows with missing values. It may lead to a situation where the valuable data may be lost_
<br/>

_Instead we should be able to Normalize the data_

<br/><br/><br/>

In [25]:
df.loc[df['age'].isnull(), 'age'] = df.age.mean()

In [26]:
df

Unnamed: 0,name,age,gender,designation
0,a,20.0,Male,VP
1,b,27.0,Female,CEO
2,c,35.0,Male,
3,d,27.0,Female,VP
4,e,18.0,Male,VP
5,f,27.0,Female,CEO
6,,35.0,Male,MD


#### Convert categorical variable into dummy/indicator variables 

<br/><br/>

__Most of the time, the Machine Learning algorithm takes numerical values rather than text/string values. We can create binary values for the columns using the function__

<br/><br/>

- DataFrame.get_dummies()

<br/><br/>

In [27]:
df = pd.get_dummies(df, columns=['gender'])

In [28]:
df

Unnamed: 0,name,age,designation,gender_Female,gender_Male
0,a,20.0,VP,0,1
1,b,27.0,CEO,1,0
2,c,35.0,,0,1
3,d,27.0,VP,1,0
4,e,18.0,VP,0,1
5,f,27.0,CEO,1,0
6,,35.0,MD,0,1


In [29]:
df = pd.get_dummies(df, columns=['designation'])

In [30]:
df

Unnamed: 0,name,age,gender_Female,gender_Male,designation_CEO,designation_MD,designation_VP
0,a,20.0,0,1,0,0,1
1,b,27.0,1,0,1,0,0
2,c,35.0,0,1,0,0,0
3,d,27.0,1,0,0,0,1
4,e,18.0,0,1,0,0,1
5,f,27.0,1,0,1,0,0
6,,35.0,0,1,0,1,0


<br/><br/>

####  What about Name?

<br/><br/>

_How can we convert name to numeric values?_

<br/><br/><br/>

_Not needed because names hardly have any impact on the data. So we can drop the same_

<br/>

#### Dropping the non necessary columns

<br/><br/>

In [31]:
df.drop(['name'], axis = 1)

Unnamed: 0,age,gender_Female,gender_Male,designation_CEO,designation_MD,designation_VP
0,20.0,0,1,0,0,1
1,27.0,1,0,1,0,0
2,35.0,0,1,0,0,0
3,27.0,1,0,0,0,1
4,18.0,0,1,0,0,1
5,27.0,1,0,1,0,0
6,35.0,0,1,0,1,0


<br/><br/>

### Adding a new column in the DataFrame

<br/><br/>

_If required, we can add additional column in the dataframe using NumPy arrays (or Pandas Series)_

<br/><br/>

In [32]:
np_arr = np.array([1,2,3,4,5,6,7], dtype=np.int8)
df['new_col'] = np_arr

In [33]:
df

Unnamed: 0,name,age,gender_Female,gender_Male,designation_CEO,designation_MD,designation_VP,new_col
0,a,20.0,0,1,0,0,1,1
1,b,27.0,1,0,1,0,0,2
2,c,35.0,0,1,0,0,0,3
3,d,27.0,1,0,0,0,1,4
4,e,18.0,0,1,0,0,1,5
5,f,27.0,1,0,1,0,0,6
6,,35.0,0,1,0,1,0,7


#### Machine Learning is all about Mathematics

And just like NumPy, Pandas DataFrame supports Mathematical Operations. Let's create a DataFrame from a numeric NumPy Arrays

In [34]:
np_arr = np.array([[1,2,3,4],
                   [5,6,7,8],
                   [9,10,11,12],
                   [13,15,16,16],
                   [17,18,19,20]])

df = pd.DataFrame(np_arr, columns=['first', 'second', 'third', 'fourth'])

In [35]:
df

Unnamed: 0,first,second,third,fourth
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12
3,13,15,16,16
4,17,18,19,20


In [36]:
df * df

Unnamed: 0,first,second,third,fourth
0,1,4,9,16
1,25,36,49,64
2,81,100,121,144
3,169,225,256,256
4,289,324,361,400


In [37]:
df * 100

Unnamed: 0,first,second,third,fourth
0,100,200,300,400
1,500,600,700,800
2,900,1000,1100,1200
3,1300,1500,1600,1600
4,1700,1800,1900,2000


In [38]:
df & 20

Unnamed: 0,first,second,third,fourth
0,0,0,0,4
1,4,4,4,0
2,0,0,0,4
3,4,4,16,16
4,16,16,16,20
