# Python and Pandas Data Frames

"*Data! Data! Data! I can't make bricks without clay.*" - Sherlock Holmes

## Python Packages and Built-in Functions

Python has a ton of packages that make doing complicated stuff very easy. We won't discuss how to install packages, or give a detailed list of what packages exist, but we will give a brief description about how they are used. 

An easy way to think of why package are useful is by thinking: "**Python packages give us access to MANY functions**".

Packages contain pre-defined functions (built-in) that make our life easier!  We've seen pre-defined functions before, for example, the funciton 'str()' that we used to convert numbers into strings in the Python Basics notebook.

In this class we will use four packages very frequently: `pandas`, `sklearn`, `matplotlib`, and `numpy`:

- **`pandas`** is a data manipulation package. It lets us store data in data frames. More on this soon.
- **`sklearn`** is a machine learning and data science package. It lets us do fairly complicated machine learning tasks, such as running regressions and building classification models with only a few lines of code. 
- **`matplotlib.pyplot`** lets you make plots and graphs directly from your code.  This can be a secret weapon when combined with notebooks, as you can very easily rerun analyses on different data or with slightly different code, and the graphs can just appear magically.  
- **`seaborn`** an extension to matplotlib that really helps make your plots look more appealing
- **`numpy`** (pronounced num-pie) is used for doing "math stuff", such as complex mathematical operations (e.g., square roots, exponents, logs), operations on matrices, and more. 

As we use these through the semester, their usefulness will become increasingly apparent.

To make the contents of a package available, you need to **import** it. Sometimes it is easier to use short names for packages. This has become the norm now, so let's do it so that you recognize it if you encounter it in your work.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

What about the package **Pandas**? 

Pandas gives us the **DATAFRAME** -- one of the main data structures used in data analytics.

A Dataframe is 2-dimensional "labeled" data structure with columns of potentially different types. It is generally the most commonly used pandas object. Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. It's often convenient to think of it as a spreadsheet with super powers! [More details here](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)

Pandas data frames can be constructed from most common data sources a data scientist will encounter: csv files, excel spreadsheets, sql databases, json, url pointers to other data sources, and even from other data already stored in one's python code. 

Let's take a look at a data frame that could be familiar to you.


In [None]:
# load datasets
train = pd.read_csv('train.csv',index_col='PassengerId')

In [None]:
# view first 5 rows in train data frame
train.head()

### Data Exploration
We now have the data loaded in a pandas data frame, as a starter, let's see some of the (MANY!) ways pandas makes it convenient to explore a dataset。 Before we start our analysis, it is always a good practice to look into our data first.

In [None]:
# get data frame info
train.info()

In [None]:
# Some general stats about the data
train.describe()

In [None]:
# description for a categorical variable
train['Embarked'].describe()

In [None]:
# What is the distribution of Survive? 
train["Survived"].value_counts()

In [None]:
# Columns can be selected using the `[]` operator, which accepts one column name or a list of several
train[["Sex", "Survived"]].head(5)

For selecting rows from the data there are two options:
- `.loc`: for selecting rows based on the _row label_
- `.iloc`: for selecting rows based on the _row number_

In [None]:
# Returns the 5th row.
train.iloc[5]

In [None]:
# Returns the first 5 rows
train.iloc[:5]

One can also select those rows that match a particular condition. 

In [None]:
# rows/instances that have an acceleration less than 10 seconds
train[train['Fare'] > 100].head(5)

#### Data visualization

In [None]:
# boxplot for Age
sns.boxplot(x=train['Age'])

In [None]:
# boxplot for Age Regarding Survived
sns.boxplot(x = train['Survived'], y = train['Age'])

In [None]:
# histogram for Age
train['Age'].hist()

In [None]:
# Bar chart for Survived
train['Survived'].value_counts().plot(kind='bar')

In [None]:
g = sns.FacetGrid(train, hue='Survived', col='Pclass', margin_titles=True)
g=g.map(plt.scatter, 'Fare', 'Age',edgecolor='w').add_legend()

### Data Preparation - Deal with missing values

In [None]:
# Crate a copy of the original data
train_df = train.copy()

In [None]:
train_df.fillna({'Age':train['Age'].mean()}, inplace= True)

In [None]:
# check whether missing value in "Age" is replaced or not
train_df['Age'].describe()

In [None]:
# another way to check number of nulls
print(train.Age.isnull().sum())
print(train_df.Age.isnull().sum())

In [None]:
# replace missing value for "Embarked" in train data with mode
train_df.fillna({'Embarked':train['Embarked'].mode()}, inplace= True)

In [None]:
# Check whether the missing values are being replaced
print(train.Embarked.isnull().sum())
print(train_df.Embarked.isnull().sum())

In [None]:
train['Embarked'].mode()

In [None]:
train_df.fillna({'Embarked':train['Embarked'].mode()[0]}, inplace= True)

In [None]:
train[train['Embarked'].isnull()]

In [None]:
# drop 'Cabin' for both train and test data
train_df = train_df.drop(columns='Cabin')

In [None]:
train_df.info()

### Data Preparation - Handle Categorical Variables

In [None]:
# get dummy variables for categorical varialbes 'Sex' and 'Embarked' in train 
train_df =pd.get_dummies(train_df, columns=['Sex', 'Embarked'],drop_first = True)

# this line of code not only help you to get dummies and autometically drop the first column, it also help you to delete the original column

In [None]:
train_df

### Data Preparation - Discretization for Continuous Variables

In [None]:
# Bining / Descritization
# give names for different age group
group_names = ['Young', 'Middle_aged', 'Senior']

# divide Age into 3 equal interval groups and give corresponding names
train_df['Age-binned']=pd.cut(train_df['Age'], 3 , labels=group_names)

# View Age-binned in bar chart
train_df['Age-binned'].value_counts().plot(kind='bar')

### Data Preparation - Data Preparation - Normalization for Continuous Variables

In [None]:
# normalize Fare
# import library 
from sklearn import preprocessing

In [None]:
# Apply min-max normalization on a single attribute
minmax_scaler = preprocessing.MinMaxScaler().fit(train[['Fare']])
train_df['Fare_minmax']=minmax_scaler.transform(train[['Fare']])

# Apply z-score normalization on a single attribute
zscore_scaler = preprocessing.StandardScaler().fit(train[['Fare']])
train_df['Fare_zscore']=zscore_scaler.transform(train[['Fare']])

In [None]:
train_df[['Fare','Fare_minmax','Fare_zscore']]

In [None]:
train_df.info()

Pandas is widely used and has a very active development community contributing new features. If there is some kind of analysis you want to do on your data, chances are, it already exists. You can google to find the information you need.

One important component of pandas is indexing and selecting components of the data. This is a extremely rich topic, so here are some examples. Please [consult the documentation](https://pandas.pydata.org/pandas-docs/stable/indexing.html) for more info. 