# Chapter 1 - Introduction

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns #Samuel Norman Seaborn
from sklearn.preprocessing import scale
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.decomposition import PCA

%matplotlib inline
plt.style.use('seaborn-white')
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

`Python` is a general-purpose language with statistics modules. When it comes to building complex analysis pipelines that mix statistics with e.g. image analysis, text mining, or control of a physical experiment, the richness of `Python` is an invaluable asset

## Pandas

### Constructing data

One way to think a `Series` is regarded it as a labeled array.
Creating a `Series` by passing a list of values, letting pandas create a default integer *index*:

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

Since we did not specify an index for the data, a default one consisting of the integers `0` through `N - 1` (where `N` is the length of the data) is created. Often you'll want to create a Series with an index identifying each data point with a label:

In [None]:
s2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
s2

Another way to think about a `Series` is as a fixed-length, ordered dictionary, as it is a mapping of index values to data values.

In [None]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
s3 = pd.Series(sdata)
s3

We will store and manipulate this data in a `pandas.DataFrame`, from the `pandas` module. It is the `Python` equivalent of the spreadsheet table. It is different from a `2D numpy` array as it has named columns, can contain a mixture of different data types by column, and has elaborate mechanisms. The `DataFrame` has both a row and column index.

In [None]:
dates = pd.date_range("20220101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Creating from arrays: A `pandas.DataFrame` can also be seen as a dictionary of 1D `Series` that share the same index, eg arrays or lists. If we have 3 `numpy` arrays:

In [None]:
t = np.arange(10) #start from 0
sin_t = np.sin(t)
cos_t = np.cos(t)
df2 = pd.DataFrame({'t': t, 'sin': sin_t, 'cos': cos_t})
df2

One of the most common is from a dictionary of equal-length lists or `NumPy` arrays:

In [None]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

The columns of the resulting `DataFrame` may have different `dtypes`.

In [None]:
df2.dtypes

### Viewing data

In [None]:
s.array, s.index

In [None]:
df.head()

In [None]:
df.tail(3)

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values

`describe()` shows a quick statistic summary of your data:

In [None]:
df.describe()

Sorting by an axis:

In [None]:
df.sort_index(axis=1, ascending=False)

Sorting by values:

In [None]:
df.sort_values(by="B")

### Selecting data

Selecting a single column, which yields a `Series`, equivalent to `df.A`:

In [None]:
df["A"]

In [None]:
df[0:3] #Selecting via [], which slices the rows. df["20210101":"20210103"] also works

Selection by labels of columns or rows (loc)

In [None]:
df.loc[:, ["A", "B"]]  

In [None]:
df.loc["20220101":"20220104","A":"C"]

Selection by position (Simiar to `NumPy` and `Python`)

In [None]:
df.iloc[3]

In [None]:
df.iloc[3:5, 0:2]

Boolean indexing

In [None]:
df[df["A"] > 0]

Setting and adding data

In [None]:
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2

An important method on pandas objects is `reindex`, which means to create a new object with the values rearranged to align with the new index.

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

In [None]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"]) # Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.
obj2  #pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. Check dropna(), fillna() and isna()

### Computation about data

In [None]:
df.mean()

In [None]:
df.apply(np.cumsum)

### Merge and Group data

In [None]:
pieces = [df[:2], df[2:4], df[4:]]
pieces, type(pieces[0])

In [None]:
pd.concat(pieces) #Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive.

In [None]:
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
df

In [None]:
df.groupby("A").sum()

### I/O

In [None]:
df.to_excel("foo.xlsx", sheet_name="Sheet1")

In [None]:
df.to_csv("foo.csv")

For more information, refer to https://wesmckinney.com/book/pandas-basics.html and https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

## Seaborn

Statistical analysis is a process of understanding how variables in a dataset relate to each other and how those relationships depend on other variables. Visualization can be a core component of this process because, when data are visualized properly, the human visual system can see trends and patterns that indicate a relationship.

The "tips" dataset https://www.kaggle.com/ranjeetjain3/seaborn-tips-dataset

In [None]:
tips = sns.load_dataset("tips")
print(tips.shape)
tips.head()

### Scatterplot

The scatter plot is a mainstay of statistical visualization. It depicts the **joint distribution of two variables** using a cloud of points, where each point represents an observation in the dataset. 

In [None]:
# Scatterplot
sns.relplot(x="total_bill", y="tip", data=tips) 

While the points are plotted in two dimensions, another dimension can be added to the plot by conditioning a third variable. In `seaborn`, this is referred to as using a “hue/styple/size semantic”, because the color/style/size of the point gains meaning:

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker", data=tips) 

In [None]:
sns.relplot(x="total_bill", y="tip", style="smoker", data=tips) # you can use a different marker style for each class

In [None]:
sns.relplot(x="total_bill", y="tip", hue="size", data=tips) #if hue is numeric rather than categorical

In [None]:
sns.relplot(x="total_bill", y="tip", size="size", sizes=(15, 200), data=tips) #size rather than colors

Note that we can plot small multiples by using row and col variable

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker",
            col="time", data=tips) #show in different subplot

### Lineplot

With some datasets, you may want to understand changes in one variable as a function of time, or a similarly continuous variable. In this situation, a good choice is to draw a line plot to **emphasize the continuity**.

In [None]:
df = pd.DataFrame(dict(time=np.arange(500),
                       value=np.random.randn(500).cumsum()))
g = sns.relplot(x="time", y="value", kind="line", data=df)

More complex datasets will have multiple measurements for the same value of the x variable. The default behavior in seaborn is to aggregate the multiple measurements at each x value by plotting the mean and the 95% confidence interval around the mean by bootstraping:

In [None]:
fmri = sns.load_dataset("fmri")
sns.lineplot(x="timepoint", y="signal", hue="event", style="event", markers=True, data=fmri)

### Histplot

An early step in any effort to analyze or model data should be to understand **how the variables are distributed**. Techniques for distribution visualization can provide quick answers to many important questions. What range do the observations cover? What is their central tendency? Are they heavily skewed in one direction? Is there evidence for bimodality? Are there significant outliers? Do the answers to these questions vary across subsets defined by other variables?

In [None]:
sns.displot(x="total_bill", data = tips) #check parameter bins and binwidth

In [None]:
sns.histplot(x="day", hue="sex", data=tips) #By default, the different histograms are “layered” on top of each other and, in some cases, they may be difficult to distinguish.

In [None]:
sns.histplot(x="day", hue="sex", multiple="dodge", shrink=.8, data=tips)

### Kdeplot

Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate:

In [None]:
sns.kdeplot(x="total_bill", data=tips)

In [None]:
sns.kdeplot(x="total_bill", hue="time", multiple="stack", data=tips)

### Joinplot

`jointplot()` augments a bivariate relatonal or distribution plot with the marginal distributions of the two variables. By default, `jointplot()` represents the bivariate distribution using `scatterplot()` and the marginal distributions using `histplot()`:

In [None]:
sns.jointplot(x="total_bill", y="tip", data=tips) 

### Pairpolt

The `pairplot()` function offers a similar blend of joint and marginal distributions. Rather than focusing on a single relationship, however, `pairplot()` uses a “small-multiple” approach to visualize the univariate distribution of all variables in a dataset along with all of their pairwise relationships:

In [None]:
sns.pairplot(tips)

### Boxplot

As the size of the dataset grows, categorical scatter plots become limited in the information they can provide about the **distribution** of values within each category. When this happens, there are several approaches for summarizing the distributional information in ways that facilitate easy comparisons across the category levels.

In [None]:
sns.catplot(x="day", y="total_bill", kind="box", data=tips)

In [None]:
sns.catplot(x="day", y="total_bill", hue="smoker", kind="box", data=tips) # When adding a hue semantic, the box for each level of the semantic variable is moved along the categorical axis so they don’t overlap

### Barplot

Rather than showing the distribution within each category, you might want to show an estimate of the **central tendency** of the values. In seaborn, the `barplot()` function operates on a full dataset and applies a function to obtain the estimate (taking the mean by default). When there are multiple observations in each category, it also uses bootstrapping to compute a confidence interval around the estimate, which is plotted using error bars:

In [None]:
ax = sns.barplot(x="day", y="tip", data=tips)

In [None]:
sns.catplot(x="day", y="total_bill", hue="sex", kind="bar", data=tips)

### Countplot

In [None]:
sns.countplot(x="smoker", data=tips) #simply count the number

### Pointplot

This function also encodes the value of the estimate with height on the other axis, but rather than showing a full bar, it plots the **point estimate** and confidence interval

In [None]:
sns.pointplot(x="day", y="tip", data=tips, ci=68)

### Regplot/lmplot
In the spirit of Tukey, the regression plots in seaborn are primarily intended to add a visual guide that helps to emphasize patterns in a dataset during exploratory data analyses. The goal of `seaborn`, however, is to make exploring a dataset through visualization quick and easy, as doing so is just as (if not more) important than exploring a dataset through tables of statistics.

In [None]:
sns.regplot(x="total_bill", y="tip", data=tips)

When the y variable is binary, simple linear regression also “works” but provides implausible predictions. The solution in this case is to fit a logistic regression, such that the regression line shows the estimated probability of y = 1 for a given value of x:

In [None]:
tips["big_tip"] = (tips.tip / tips.total_bill) > .15
sns.lmplot(x="total_bill", y="big_tip", data=tips,
           logistic=True, y_jitter=.03)

In [None]:
sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", data=tips)

### Heatmap

In [None]:
corr = tips.corr()
sns.heatmap(corr)

Customized your plot https://seaborn.pydata.org/tutorial/axis_grids.html. For more information, see https://seaborn.pydata.org/tutorial.html

## Lab1: Loading Datasets and processing

<center><img src="https://pandas.pydata.org/docs/_images/02_io_readwrite.svg"></center>

<div align="center"> source: https://stackoverflow.com/questions/2354725/what-exactly-is-llvm </div>

Datasets available on https://www.statlearning.com/resources-second-edition

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
Wage = pd.read_csv('/content/drive/MyDrive/Lab/Data/Wage.csv')
Wage.head(3)

In [None]:
Wage.shape, Wage.columns

In [None]:
Wage.info(), Wage.describe()

In [None]:
Wage[Wage['year'] == 2004]['wage'].mean()

In [None]:
groupby_year = Wage.groupby('year')
for year, value in groupby_year['wage']:
    print((year, value.mean()))

In [None]:
groupby_year # groupby_year is a powerful object that exposes many operations on the resulting group of dataframes:

`Seaborn` combines simple statistical fits with plotting on `Pandas` dataframes.

In [None]:
# creating plots
# Scatter plot with polynomial regression line, the regression line is bounded by the data limits. truncate=True.
plt.figure(figsize=(4,6))
sns.scatterplot(x="age", y="wage", data=Wage, alpha=0.1)
sns.regplot(x="age", y="wage", data=Wage, order=4, truncate=True, scatter=False) 

In [None]:
# creating plots
# Scatter plot with polynomial regression line
plt.figure(figsize=(4,6))
sns.scatterplot(x="year", y="wage", data=Wage, alpha=0.1)
sns.regplot(x="year", y="wage", data=Wage, order=1, truncate=True, scatter=False)

In [None]:
print(Wage.education.unique())
originalL = list(Wage.education.unique())
orderL = [originalL[0], originalL[3], originalL[2], originalL[1], originalL[4]]

In [None]:
plt.figure(figsize=(4,6))
ax = sns.boxplot(x="education", y="wage", data=Wage, order=orderL)
ax.set_xticklabels([t.get_text().split()[0][0]  for t in ax.get_xticklabels()])

In [None]:
Smarket = pd.read_csv('/content/drive/MyDrive/Lab/Data/Smarket.csv')
Smarket.head()

In [None]:
plt.figure(figsize=(4,6))
ax =sns.boxplot(x="Direction", y="Lag1", data=Smarket, order=["Down", "Up"])
ax.set_ylabel("Percentage change in S&P")
ax.set_xlabel("Today's Direction")
plt.title("Yesterday")

In [None]:
plt.figure(figsize=(4,6))
ax = sns.boxplot(x="Direction", y="Lag2", data=Smarket, order=["Down", "Up"])
ax.set_ylabel("Percentage change in S&P")
ax.set_xlabel("Today's Direction")
plt.title("Two Days Previois")

In [None]:
plt.figure(figsize=(4,6))
ax = sns.boxplot(x="Direction", y="Lag3", data=Smarket, order=["Down", "Up"])
ax.set_ylabel("Percentage change in S&P")
ax.set_xlabel("Today's Direction")
plt.title("Three Days Previois")

In [None]:
Smarket = pd.read_csv('/content/drive/MyDrive/Lab/Data/Smarket.csv', index_col=0) #use col0 as index

In [None]:
Smarket.head()

In [None]:
Smarket.loc[:'2004'][['Lag1','Lag2']]

In [None]:
X_train = Smarket.loc[:'2004'][['Lag1','Lag2']]
y_train = Smarket.loc[:'2004']['Direction']

X_test = Smarket.loc['2005':][['Lag1','Lag2']]
y_test = Smarket.loc['2005':]['Direction']

In [None]:
qda = QuadraticDiscriminantAnalysis()
pred = qda.fit(X_train, y_train).predict_proba(X_test)

In [None]:
qda.classes_

In [None]:
plt.figure(figsize=(4,6))
sns.boxplot(x=y_test, y=pred[:,0]) #predicted probability for decrease

In [None]:
NCI60 = pd.read_csv('/content/drive/MyDrive/Lab/Data/NCI60_data.csv')
NCI60.head()

In [None]:
NCI60 = pd.read_csv('/content/drive/MyDrive/Lab/Data/NCI60_data.csv').drop('Unnamed: 0', axis=1)
NCI60.columns = np.arange(NCI60.columns.size)
NCI60.head()

In [None]:
X = pd.DataFrame(scale(NCI60))
X.shape

In [None]:
y = pd.read_csv('/content/drive/MyDrive/Lab/Data/NCI60_labs.csv', usecols=[1], skiprows=1, names=['type'])
y.shape

In [None]:
y.type.value_counts()

In [None]:
# Fit the PCA model and transform X to get the principal components
pca2 = PCA()
NCI60_plot = pd.DataFrame(pca2.fit_transform(X))

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))

# Left plot
sns.scatterplot(x =0, y=-NCI60_plot[1], data=NCI60_plot, hue=y.type, alpha=0.5, s=50, ax=ax1, legend=False)
ax1.set_xlabel('Z1') 
ax1.set_ylabel('Z2')
   

# Right plot
sns.scatterplot(x = 0, y= 2, data=NCI60_plot, hue=y.type, alpha=0.5, s=50, ax=ax2)
ax2.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=2)
ax2.set_xlabel('Z1')  
ax2.set_ylabel('Z3')