# __Introduction to Python for Data Science__
## _CSE Mentor Program - University of Colorado, Denver. Spring-2019_

This workshop is intended to introduce Python to Undergrad and Graduate students in the context of Data Science techniques. 

During three sessions we will covering the basis of the Python Language, the use of Pandas to access and manipulate data and the Scikit-Learn library to do some basic analysis. 

# Session 3 - Introduction to Pandas
In this session we will focus on Pandas, a library designed to manage relational data very useful to manipulate small, mediumsize and large datasets. 

<hr/>


In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.style as mplstyle


# Pandas

## 1. Series

A pandas *Series* object is like an array or list.

**A Series is a one-dimensional labeled _(indexed)_ array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).** 
The axis labels are collectively referred to as the index.

In [None]:
print("A list:")
[1,3,5,np.nan,6,8]

In [None]:
s = pd.Series([1,3,5,np.nan,6,8])
print("A Series:")
s

#### We can create a Series with all elements initiliazed by a same value by varying the indexes

In [None]:
pd.Series(2, index=list(range(4)), dtype='float32')

## 2. Categorical Set
Represents a categorical variable.
Categoricals can only take on only a limited, and usually fixed, number of possible values (categories). 
In contrast to statistical categorical variables, a Categorical might have an order, but numerical operations (additions, divisions, …) are not possible.

In [None]:
pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])

In [None]:
c = pd.Categorical(['a','b','c','a','b','c'], ordered=True,   categories=['z','c', 'b', 'a','w'])
print("Category C:")
print(c)

print("")
print("In the categorical list C, the Min value:",c.min()," Max value:",c.max() )

print("")
print("If there is a value that does not match a valid category, it will be set as NaN.")
print(pd.Categorical(['a','b','c','a','b','c','d','z'], ordered=True,   categories=['c', 'b', 'a']))

## 3. DataFrame
A **DataFrame** can contain multiple **Series** objects.

In [None]:
df2 = pd.DataFrame({  'A' : 1.,
                      'B' : pd.Timestamp('20130102'),
                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                      'D' : np.array([3] * 4,dtype='int32'),                     #repeates element 3 four times.
                      'E' : pd.Categorical(["test","train","test","train"]),
                      'F' : 'foo' })

display(df2)

#### Each column in a *DataFrame* can have a separate data type.

In [None]:
print("Dataset Columns Series types")
print(df2.dtypes)
print("")
print("")
print("Each axis has a type and a set of indexes")
print(df2.axes)

### 3.1. Creating Dataframes

#### Reading From File
One way to create a *DataFrame* is to read in a data file, like a CSV (Comma Separated Values) file.

The `header = None` keyword argument tells Pandas that the CSV file does not contain a header row (A row with column names).

In [None]:
csv_df = pd.read_csv("iris.csv",header=None)

The iris.csv is a well-known dataset for data analysis and machine learning. https://archive.ics.uci.edu/ml/datasets/iris

The dataset describes several characteristics of Irises (flower) and the type of the corresponding sample.

<img width=30% src="./files/flower-labelled_med.jpeg">

1. sepal length in cm 
2. sepal width in cm 
3. petal length in cm 
4. petal width in cm 
5. class: 
  - Iris Setosa 
  - Iris Versicolour 
  - Iris Virginica

The content looks as follows:
<pre>
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
</pre>

#### Calling `<df_name>.head(n)` will print the first `n` rows of the DataFrame

In [None]:
csv_df.head(10)

#### Naming Columns 
Our columns have no names because the CSV file did not contain a header row. 

We can add column names by setting the `.columns` attribute of the *DataFrame* with a list containing the desired column names.

In [None]:
csv_df.columns = ["sepal_length","sepal_width","petal_length","petal_width","class"]

In [None]:
csv_df.head(10)

#### Creating from List Or Dictionaries
We can also create a *DataFrame* directly in Python using Python's *list* or *dict* types.

***From List***

In [None]:
data_list = [[0, "string_0", 3.4],[1,"string_1", 3.5],[2,"string_2", 3.6]]
data_list

We can also set the column names used in the *DataFrame* when we instatiate it, using the `columns=` keyword parameter

In [None]:
list_df = pd.DataFrame(data_list, columns=["myint","mystring","myfloat"])
list_df

***From Dictionary***

When creating a *DataFrame* using a dict, the keys of the dictionary will become the columns in the DataFrame.

In [None]:
rows = []
for i in range(10):
    rows.append({"id":i, "name":"Name of object "+str(i), "index_squared":i**2})
# rows.append({"id":10, "name":"Name of object 10", "index_squared":100, "other_data":"Only available for index 10"})
rows

In [None]:
dict_df = pd.DataFrame(rows)
dict_df


### 3.2 DataFrame Viewing and Selection

Evaluating a *DataFrame* object in a Jupyter notebook will print out a nice, formatted view of the *DataFrame*.
If the *DataFrame* is too large to print entirely, only the first and last few rows will be visible.

In [None]:
dict_df

We can select individual columns from the *DataFrame* for viewing or other operations using the array access operator `[]`.

Selecting a column will return the *Series* object for that column.

In [None]:
dict_df["id"]

 Multiple columns can be selected by passing a *list* inside the array operator.

When selecting multiple columns, the order of column names in the list determines the order of the columns in the resulting *DataFrame*.

In [None]:
dict_df[["name","id"]]

We can also select rows of the *DataFrame* using the `.iloc[]` function with integer indices or slices.

In [None]:
dict_df.iloc[1:3]

__iloc__ can be used to also splice the DF columns by index numnber (not column name)

In [None]:
dict_df.iloc[1:3,0:2]

__.loc__ can be used to retrieve columns and rows from the dataframe.

In [None]:
display(dict_df.loc[0:3,["id","name"]])

display(dict_df.loc[:,["id","name"]])

display(dict_df.loc[0:2,:])

We can also select using *boolean indexing*, which returns only the rows that meet a condition.

In this case, we only want to see rows where the value of column `index_squared` is >= 2.

In [None]:
dict_df["index_squared"] >= 2

In [None]:
dict_df[dict_df["index_squared"] >= 2]

#### Sorting dataframes
We can sort the *DataFrame* on one or more columns using the `sort()` function of *DataFrame*.

In [None]:
dict_df

Here we sort the rows of the *DataFrame* by the values in column `index_squared`. 

The `ascending=False` parameter specifies that rows should be sorted in *descending* order.

In [None]:
display(dict_df.sort_values("index_squared",ascending=False))
display(dict_df)

Sorting is not in place, unless we use the `inplace=True` argument

In [None]:
display(dict_df.sort_values("index_squared",ascending=False, inplace=True))
display("No dataframe is returned in the previous case")
display(dict_df)

We can also perform a sorting operation using the Index of the *DataFrame*.

In [None]:
dict_df.sort_index(ascending=False)

### 3.3. Adding New Data

In [None]:
dates = pd.date_range('20130101', periods=6)
display(dates)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
display(df)

In [None]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1

We want to add a new column containing the values in `s1` to our *DataFrame* `df`.

In [None]:
df["F"] = s1
df

Set values by label, position, or by using a NumPy array.

In [None]:
# Set by label
df.at[dates[0],'A'] = 0

# Set by position
df.iat[0,1] = 100

# Set by NumPy array
display(np.array([5] * len(df)))
df.loc[:,'D'] = np.array([5] * len(df))

In [None]:
df

### 3.4. Merging and Joining with DataFrames

We can concatenate *DataFrame*s together into a larger *DataFrame*.

In [None]:
df = pd.DataFrame(np.random.randn(10, 4))
df

In [None]:
pieces = [ df[:3], df[3:7], df[7:] ]
display(pieces)
type(pieces)

In [None]:
pd.concat(pieces)

#### We can also join *DataFrame* objects together using SQL-style join operations.

In [None]:
left  = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

display(left)
display(right)

The `on=` keyword argument specifies a common column name in both *DataFrame*s that will be used to join them together.

In [None]:
pd.merge(left,right,on="key")

Another example of a *DataFrame* join.

In [None]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

display(left)
display(right)

pd.merge(left, right, on='key')


#### Join using attributes with different names (theta join $\bowtie$ )

In [None]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [1, 1]})

display(left)
display(right)


pd.merge(left, right, left_on='lval',right_on='rval')

#### We can append rows to a *DataFrame* using the `.append()` member function of *DataFrame*.

In [None]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
display(df)
row3 = df.iloc[3]
display(row3)

Here we append the third row of the *DataFrame* to the end of the *DataFrame*.

In [None]:
print("Now the new row 8 has the same values as row 3.")
df.append(row3, ignore_index=True)


## 4. Writing Data to File

You can write a *DataFrame* out to file using the `.to_csv()` function of *DataFrame*. 

In [None]:
df = pd.DataFrame(np.random.randn(25,5), columns=["A","B","C","D","E"])
df[:10]

In [None]:
df.info()

#### Writing to CSV
When writing to a CSV, specify the filename as a string. You can also whether to write a header row (column name row) using the `header` keyword argument, and whether or not to write the index column out to file using the `index` keyword argument.

In this call to `to_csv()` we specify `index=False` to skip writing the index column out to a file.

In [None]:
df.to_csv("my_data.csv",index=False)

#### Writing to Pickle
We can also save the dataframe as an object using Python serialization to Pickle objects file.

In [None]:
df.to_pickle("my_data.pkl")

## 5. Matplotlib with Pandas

Using Matplotlib, we can plot data in a *Series* or a *DataFrame* directly.

In [None]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2016', periods=1000))
display(ts[:5])

In [None]:
#cummulative sum of the series
ts = ts.cumsum()
display(ts[:5])

#### Plotting a Series




In [None]:
plt.figure(figsize=(9,7))
plt.title("Random Data")
ts.plot()

In [None]:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df[:5]

In [None]:
df = df.cumsum()
df[:5]

In [None]:
df.plot(figsize=(9,7))
plt.title("More Random Data")
plt.legend(loc="best")

#### We can create a histogram of data in a DataFrame using the `hist()` function of *DataFrame*.

In [None]:
df = pd.DataFrame(
    {'length': [1.5, 0.5, 1.2, 0.9, 3],
    'width': [0.7, 0.2, 0.15, 0.2, 1.1]}, 
    index= ['pig', 'rabbit', 'duck', 'chicken', 'horse'])
df

In [None]:
hist = df.hist(bins=3)

- Ploting the distribution of the iris dataset

In [None]:
csv_df[10:51:10]

In [None]:
csv_df.hist(figsize = (40,20))

- Ploting the distribution of the length of the petals from the iris dataset

In [None]:
csv_df[["petal_length"]].hist()

In [None]:
for column in csv_df.columns:
    csv_df[column].hist( figsize=(20,10),label=column.title())
    plt.legend()


In [None]:
for column in csv_df.columns:
    csv_df[column].hist(figsize=(10,5),label=column.title())
    plt.show()