# Kickstarting Python with Jupyter Notebooks
### Tuesday, July 14 9:00 AM PDT

## Topics for Discussion

<ol>
<li>User Defined Functions</li>
<li>Strings</li>
<li>Lists</li>
<li>Looping Constructs</li>
<li>Exception Handling</li>
<li>Inbuit Modules</li>
</ol>

# Installation  

In [1]:
!pip install pandas

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import pandas as pd

# Core components of pandas: Series and DataFrames

##### A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series

![data-frames-in-python-banner_cgzjxy.jpeg](attachment:data-frames-in-python-banner_cgzjxy.jpeg)

### Creation of the dataset

In [42]:
dataset = {
    'col1': [1, 7, 1, 21, 1, 1], 
    'col2': [0, 6, 0, 0, 5, 5]
}

In [43]:
dataframe = pd.DataFrame(dataset)

dataframe

Unnamed: 0,col1,col2
0,1,0
1,7,6
2,1,0
3,21,0
4,1,5
5,1,5


In [12]:
dataframe = pd.DataFrame(dataset, index=['row1', 'row2', 'row3', 'row4', 'row5', 'row6'])

dataframe

Unnamed: 0,col1,col2
row1,1,0
row2,7,6
row3,1,0
row4,21,0
row5,1,5
row6,1,5


### Locating a specific row

In [13]:
dataframe.loc['row2']

col1    7
col2    6
Name: row2, dtype: int64

### Viewing your data

In [15]:
dataframe.head()  # display top 5 rows

Unnamed: 0,col1,col2
row1,1,0
row2,7,6
row3,1,0
row4,21,0
row5,1,5
row6,1,5


In [16]:
dataframe.head(6)

Unnamed: 0,col1,col2
row1,1,0
row2,7,6
row3,1,0
row4,21,0
row5,1,5
row6,1,5


In [17]:
dataframe.tail()

Unnamed: 0,col1,col2
row2,7,6
row3,1,0
row4,21,0
row5,1,5
row6,1,5


In [18]:
dataframe.tail(6)

Unnamed: 0,col1,col2
row1,1,0
row2,7,6
row3,1,0
row4,21,0
row5,1,5
row6,1,5


### Information about the dataframe

In [19]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, row1 to row6
Data columns (total 2 columns):
col1    6 non-null int64
col2    6 non-null int64
dtypes: int64(2)
memory usage: 304.0+ bytes


In [20]:
dataframe.shape

(6, 2)

In [49]:
dataframe.describe()     #A summary of the dataframe

Unnamed: 0,col1,col2
count,6.0,6.0
mean,5.333333,2.666667
std,8.041559,2.94392
min,1.0,0.0
25%,1.0,0.0
50%,1.0,2.5
75%,5.5,5.0
max,21.0,6.0


### Handling duplicate values

In [21]:
test_df = dataframe.append(dataframe)  # doubling up the existing data

test_df.shape

(12, 2)

In [22]:
test_df = test_df.drop_duplicates()

test_df.shape

(4, 2)

### Getting up columns & Indexes

In [24]:
dataframe.columns

Index(['col1', 'col2'], dtype='object')

In [26]:
dataframe.index

Index(['row1', 'row2', 'row3', 'row4', 'row5', 'row6'], dtype='object')

## How to work will NULL Values

1. Remove rows or columns with null values
2. Replacing null values with non-null values (imputation technique)

In [29]:
dataframe.isnull()

Unnamed: 0,col1,col2
row1,False,False
row2,False,False
row3,False,False
row4,False,False
row5,False,False
row6,False,False


In [30]:
dataframe.isnull().sum()

col1    0
col2    0
dtype: int64

In [40]:
import numpy as np # adding some null values artificially
dataframe["col3"]=np.nan

In [37]:
dataframe

Unnamed: 0,col1,col2,col3
row1,1,0,
row2,7,6,
row3,1,0,
row4,21,0,
row5,1,5,
row6,1,5,


In [38]:
dataframe.isnull()

Unnamed: 0,col1,col2,col3
row1,False,False,True
row2,False,False,True
row3,False,False,True
row4,False,False,True
row5,False,False,True
row6,False,False,True


In [39]:
dataframe.isnull().sum()

col1    0
col2    0
col3    6
dtype: int64

In [41]:
dataframe.dropna() # this will drop all the rows 

Unnamed: 0,col1,col2,col3


In [44]:
dataframe.dropna(axis=1) # this will drop only the column

Unnamed: 0,col1,col2
0,1,0
1,7,6
2,1,0
3,21,0
4,1,5
5,1,5


### What is axis=1 parameter signify? 

In [46]:
dataframe.shape

(6, 2)

##### Here rows are at 0th index and columns at 1st index in the shape tuple above 

## Now let's discuss about Imputation

#### It is a conventional feature engineering technique to keep valuable data that have null values.

There may be instances where dropping every row with a null value removes too big a chunk from your dataset,
so instead we can impute that null with another value,
usually the mean or the median of that column.

![1_MiJ_HpTbZECYjjF1qepNNQ.png](attachment:1_MiJ_HpTbZECYjjF1qepNNQ.png)

Ref : https://www.kaggle.com/residentmario/simple-techniques-for-missing-data-imputation