# Intro to pandas

**Learning Objectives:**
  * Gain an introduction to the `DataFrame` and `Series` data structures of the *pandas* library
  * Access and manipulate data within a `DataFrame` and `Series`
  * Import CSV data into a *pandas* `DataFrame`
  * Reindex a `DataFrame` to shuffle data

The primary data structures in *pandas* are implemented as two classes:

  * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.
  * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`. It is a one-dimensional labeled array capable of holding any data type(integers, strings, floating point numbers, Python objects, etc.)

## Import NumPy and pandas modules
Run the following code cell to import the NumPy and pandas modules.

In [None]:
import numpy as np
import pandas as pd

One way to create a `Series` is to construct a `Series` object. For example:


In [None]:
pd.Series(['San Francisco', 'San Jose', 'Cleveland'])

0    San Francisco
1         San Jose
2        Cleveland
dtype: object

## Creating a DataFrame

The following code cell creates a simple DataFrame containing 6 cells organized as follows:

  * 3 rows
  * 2 columns, one named `city_names` and the other named `population`

  * The first arguments provides 3 string as `series` object, the second argument provides 3 integer as `series`
  * The third argument establishes `Dataframe` object by passing a `dict` mapping `string` column names to their respective `Series`.

In [None]:
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])

pd.DataFrame({ 'City name': city_names, 'Population': population })

Unnamed: 0,City name,Population
0,San Francisco,852469
1,San Jose,1015785
2,Sacramento,485199


##Load CSV file using pandas
* To access data from the CSV file, we require a function `read_csv()`that retrieves data in the form of the data frame.
* The example below used `DataFrame.describe` to show interesting statistics analysis about the features in `DataFrame`.
* Another useful function is `DataFrame.head`, which displays the first 5 records of a `DataFrame`

In [None]:
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
california_housing_dataframe.describe()

In [None]:
california_housing_dataframe.head()

## Specifying a subset of a DataFrame

Pandas provide multiples ways to isolate specific rows, columns, slices or cells in a DataFrame.


In [None]:

#extract the certain row (in this example is row 3) as a dataframe, you could also consider this is a subset of original dataframe
print(california_housing_dataframe.iloc[[2]], '\n')

#extract the row records from the second row to the fifth row
print(california_housing_dataframe[1:4], '\n')

#extract the certain column to isolate it with the column name 'population'
print("Column 'population':")
print(california_housing_dataframe['population'])

## Manipulating data
* `drop` function used to remove the nonsensiable column
* you can use `Series.apply`,
`Series.apply` accepts as an argument a [lambda function](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions), which is applied to each value.


In [None]:
#remove the column 'longitude' and 'latitude'
df_dropped_california_housing = california_housing_dataframe.drop(['longitude', 'latitude'], axis=1)
print(df_dropped_california_housing.head())

In [None]:
# it could select the record that the poopulation is greater than 1000.
california_housing_dataframe[california_housing_dataframe['population'].apply(lambda val: val > 1000)]

## Copying a DataFrame
* **Copying.** If you call the `pd.DataFrame.copy` method, you create a true independent copy.  Changes to the original DataFrame or to the copy will not be reflected in the other. It means that you could not affect the original dataframe.


In [None]:
# Create a true copy of
print("Experiment with a true copy:")
copy_of_california_housing = california_housing_dataframe.copy()
# Modify a cell in california_housing_dataframe
california_housing_dataframe = california_housing_dataframe.drop(['longitude', 'latitude'], axis=1)
print(copy_of_california_housing)
print(california_housing_dataframe)

# Intro to visualization
* The introduction of Package to make the visualization in python
* Different plot style


## Matplotliab and Seaborn
* `Matplotlib` is a visualization library in Python for 2D plots of arrays. `Matplotlib` is written in Python and makes use of the NumPy library. Matplotlib is specifically suitable for creating basic graphs like line charts, bar charts, histograms, etc.
* `Seaborn` is a dataset-oriented library for making statistical representations in Python.

## Import the modules

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

## Histogram

To create a histgram plot in Matplotlib, we can use the `hist` method. We will also create a figure and an axis using `plt.subplots` to give our plot a title and labels.

In [None]:
# create figure and axis
fig, ax = plt.subplots()
# plot histogram
ax.hist(california_housing_dataframe['housing_median_age'])
# set title and labels
ax.set_title('Housing median age Distribution')
ax.set_xlabel('Housing median age')
ax.set_ylabel('Frequency')

## Bar Plot

A bar chart can be created using the bar method. The `bar` chart isn’t automatically calculating the frequency of a category, so we will use pandas `value_count` method to do this. The bar chart is useful for categorical data that doesn’t have a lot of different categories (less than 30) because else it can get quite messy.



In [None]:
# create a figure and axis
fig, ax = plt.subplots()
# count the occurrence of each class
data = california_housing_dataframe['housing_median_age'].value_counts()
# get x and y data
age = data.index
frequency = data.values
# create bar chart
ax.bar(age, frequency)
# set title and labels
ax.set_title('Housing median age Distribution')
ax.set_xlabel('Housing median age')
ax.set_ylabel('Frequency')

In [None]:
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])

## Scatter plot

To create a scatter plot in Matplotlib, we can use the `scatter` method. We will also create a figure and an axis using `plt.subplots` to give our plot a title and labels.

Scatter plot show the correlation between x and y.

In [None]:
# create a figure and axis
fig, ax = plt.subplots()

# scatter the sepal_length against the sepal_width (x : sepal_length; y : sepal_width)
ax.scatter(iris['sepal_length'], iris['sepal_width'])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')

We can give the graph more meaning by coloring each data point by its class. This can be done by creating a `dictionary` that maps from class to color and then scattering each point on its own using a for-loop and passing the respective color.

In [None]:
# create color dictionary
colors = {'Iris-setosa':'r', 'Iris-versicolor':'g', 'Iris-virginica':'y'}
# create a figure and axis
fig, ax = plt.subplots()
# plot each data-point
for i in range(len(iris['sepal_length'])):
    ax.scatter(iris['sepal_length'][i], iris['sepal_width'][i],color=colors[iris['class'][i]])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')

## Line Plot

x - the index of each record in the dataset

y - respective value of each record in all columns shown in different colors. For example, the value of iris sepal_length in all the points are shown in blue.




In [None]:
# get columns to plot
columns = iris.columns.drop(['class'])
# create x data
x_data = range(0, iris.shape[0])
# create figure and axis
fig, ax = plt.subplots()
# plot each column
for column in columns:
    ax.plot(x_data, iris[column], label=column)
# set title and legend
ax.set_title('Iris Dataset')
ax.set_xlabel('data point')
ax.legend()

## Heatmap
Seaborn makes it way easier to create a heatmap and add annotations:

A Heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. Heatmaps are perfect for exploring the correlation of features in a dataset.

In [None]:
sns.heatmap(iris.corr(), annot=True)