## Introduction to `pandas`

In [None]:
import pandas as pd

### Read data

`# R code`

`senic <- read.csv('dhttps://www.dropbox.com/s/82qx6810jpbk3lj/SENIC.csv?dl=1')`

In [None]:
# Read data

senic = pd.read_csv('https://www.dropbox.com/s/82qx6810jpbk3lj/SENIC.csv?dl=1')

#### View top rows:

` # R code`

`head(senic)`

In [None]:
# View top five rows

senic.head()

#### Number of rows and columns:

` # R code`

`dim(senic)`

In [None]:
# Get the number of rows and columns

senic.shape

### Slicing a data frame


In [None]:
# Let's create a random sample (with three columns) for this section

senic_sample = senic[['ID', 'Length_stay', 'Age_years']].sample(n=5, random_state=314)
senic_sample

In [None]:
senic_sample.iloc[0]

The `iloc` attribute allows indexing and slicing that references the implicit Python-style index (which always starts at 0).

On the other hand, the `loc` attribute references the explicit index present in the data frame.

In [None]:
senic_sample.loc[6]

### Column names

#### View column names:

` # R code`

`names(senic)`

In [None]:
senic.columns

# alternatively, senic.keys()

#### Change column names:

Prepare the list of column names we would like to assing to this data.

` # R code`

`colNames <- c("ID", "length_stay", "age_years", "infection_pct", 
              "culture_ratio", "xray_ratio", "num_beds", "med_sch_aff", 
              "region_num", "num_patients", "num_nurses", "avail_services", 
              "region_name", "medical_school")`

In [None]:
col_names = ["ID", "length_stay", "age_years", "infection_pct", "culture_ratio",
             "xray_ratio", "num_beds", "med_sch_aff", "region_num", "num_patients",
             "num_nurses", "avail_services", "region_name", "medical_school"]

Make sure that the length of `colNames` equals the number of columns.

` # R code`

`length(col_names) == ncol(senic)`

In [None]:
len(col_names) == len(senic.columns)

Assign column names.

` # R code`

`names(senic) <- colNames`

In [None]:
senic.columns = col_names

Change the column name of the first column.

` # R code`

`names(senic)[1] <- "hospitalID"`

In [None]:
senic.rename(columns = {'ID': 'hospital_ID'}, inplace=True)

View the new column names.

In [None]:
senic.keys()

### Column types


Check column types for all columns in the data frame:


Convert two columns from decimal (float) to integer:





#### Check the column type of a specific column.

` # R code`

`class(senic$length_stay)`

In [None]:
senic['length_stay'].dtypes

#### Check column types for all columns in the data frame.

` # R code`

`sapply(senic, class)`

In [None]:
senic.dtypes

#### Convert two columns from decimal (float) to integer.

` # R code`

`senic$length_stay <- as.integer(senic$length_stay)`
`senic$age_years <- as.integer(senic$age_years)`

In [None]:
senic.length_stay = senic['length_stay'].astype(int)

senic.age_years = senic.age_years.astype(int)

senic.dtypes

### Create column

` # R code`

`senic$weeks_stay <- senic$length_stay / 7`

`head(senic[, c('length_stay', 'weeks_stay')])`

In [None]:
# Create a new column

senic['weeks_stay'] = senic['length_stay'] / 7

senic[['weeks_stay', 'length_stay']].head()

### Summarize

#### Summarize data frame

` # R code`

`summary(senic)`

In [None]:
senic.describe()

#### Mean of a specific column

` # R code`

`mean(senic$age_years)`

In [None]:
senic.age_years.mean()

#### Mean values for all (numeric) columns

` # R code`

`sapply(senic, mean)`

In [None]:
senic.mean()

#### Categorical variable: frequency count

` # R code`

`summary(senic$region_name)`

In [None]:
senic.region_name.value_counts()

#### Categorical variable: percentage distribution

` # R code`

`summary(senic$region_name) / nrow(senic)`

In [None]:
senic.region_name.value_counts() / len(senic)

### Missing values

#### Is hospital_ID null? Create a flag for each row.

` # R code`

`is.na(senic$hospitalID)`

In [None]:
pd.isnull(senic.hospital_ID).head()

In [None]:
# Alternatively,

senic.hospital_ID.isnull().head()

#### Counts the total number of rows where hospital_ID is null

` # R code`

`sum(is.na(senic$hospitalID))`

In [None]:
sum(pd.isnull(senic.hospital_ID))

#### Counts the total number of rows where each column is null

` # R code`

`apply(is.na(senic), 2, sum)`

In [None]:
senic.isnull().sum()

In [None]:
# Alternatively, use `isin()`, which can be useful if there are special 
#  characters used to denote missing values

senic.isin([999999, 99999]).sum()

#### Let's create a small data frame with missing values

In [None]:
import numpy as np

temp = pd.DataFrame(data=['c', 'v', np.nan, 'r', None], columns=['Letter'])
temp

In [None]:
# Check the data type 

temp.Letter.dtype

In [None]:
# Use 'isnull()' to check if the column contains a missing value

temp.isnull()

In [None]:
# Select rows from the data frame where the column Letter is not missing

temp[temp.Letter.notnull()]

In [None]:
# Drop rows from the data frame where Letter is missing

temp.dropna()

In [None]:
temp

Note that the `dropna()` function did not drop rows from the original data frame. It just _returned_ the results, which can then be saved into a new (or same) data frame.

In [None]:
# Store the results into a new data frame

temp_clean = temp.dropna()
temp_clean

In [None]:
# Impute missing values

temp.fillna('x')

In [None]:
temp

Again, the `fillna()` function did not drop rows from the original data frame. It just returned the results, which can then be saved into a new (or same) data frame.

Alternatively, you can use `inplace=True` to replace the original data frame.

In [None]:
temp.fillna('x', inplace=True)
temp

Note: The `inplace=` option will be deprecated at some point in the future.

### Perason correlation coefficient

#### Correlation coefficient between two variables

` # R code`

`cor(senic$num_beds, senic$length_stay)`

In [None]:
senic[['num_beds', 'length_stay']].corr()

<img src='https://www.dropbox.com/s/m7ijuw8e8m6s9lx/corr_sig.JPG?dl=1' width="600" align=left>

In [None]:
# Using scipy

from scipy.stats.stats import pearsonr

pearsonr(senic['num_beds'], senic['length_stay'])

#### Correlation coefficient between all (numeric) columns

` # R code`

`cor.test(senic$num_beds, senic$length_stay)`

In [None]:
senic.corr()

### Visualize

`ggplot(data=senic, mapping = aes(x = num_beds, y = length_stay)) +
  geom_point() +
  geom_smooth()`


Load the necessary library:

`library(ggplot2)`

`ggplot(data=senic, mapping = aes(x = num_beds, y = length_stay)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Number of Beds", y = "Length of Stay (days)")`
  
Output:

<img src='https://www.dropbox.com/s/n2o7tgqhoblnejt/ggplot.png?dl=1' width="600" align=left>

#### Using pandas

In [None]:
ax = senic.plot.scatter('num_beds', 'length_stay')

#### Using matplotlib

In [None]:
import matplotlib.pyplot as plt

plt.scatter(senic['num_beds'], senic['length_stay'])
plt.show()

Add axis labels.

In [None]:
plt.scatter(senic['num_beds'], senic['length_stay'])

plt.xlabel('Number of Beds')
plt.ylabel('Length of  Stay')
plt.show()

Increase plot size.

In [None]:
plt.figure(figsize=(12, 9))

plt.scatter(senic['num_beds'], senic['length_stay'])

plt.xlabel('Number of Beds', fontsize=14)
plt.ylabel('Length of  Stay', fontsize=14)
plt.show()

Modify point aesthetics.

In [None]:
plt.figure(figsize=(12, 9))

plt.scatter(senic['num_beds'], senic['length_stay'], 
            s=100, c='darkorange', alpha=.5)

plt.xlabel('Number of Beds', fontsize=14)
plt.ylabel('Length of  Stay', fontsize=14)


plt.show()

#### Using seaborn

In [None]:
import seaborn as sns

sns.set(style='white')

sns.scatterplot('num_beds', 'length_stay', data=senic, 
                s=100, color='darkorange', alpha=.5)

Increase plot size.

In [None]:
plt.figure(figsize=(12, 9))

sns.set(style='whitegrid')

sns.scatterplot('num_beds', 'length_stay', data=senic, 
                s=100, color='darkorange', alpha=.5)

Add axis labels.

In [None]:
sns.set(style='darkgrid')

plt.figure(figsize=(12, 9))

plt.scatter(senic['num_beds'], senic['length_stay'], 
            s=100, c='darkorange', alpha=.5)

plt.xlabel('Number of Beds', fontsize=14)
plt.ylabel('Length of  Stay', fontsize=14)

plt.show()

Add the the confidence interval region around the regression line.

In [None]:
sns.set(style='darkgrid')

plt.figure(figsize=(12, 9))

sns.regplot('num_beds', 'length_stay', data=senic, color='darkorange')

plt.xlabel('Number of Beds', fontsize=14)
plt.ylabel('Length of  Stay', fontsize=14)

plt.show()

Joint plot

In [None]:
sns.set(style='darkgrid')

sns.jointplot('num_beds', 'length_stay', data=senic, color='darkorange', 
              kind="reg", height=9, ratio=5)

plt.xlabel('Number of Beds', fontsize=14)
plt.ylabel('Length of  Stay', fontsize=14)

plt.show()

Smooth line using lowess

In [None]:
sns.set(style='darkgrid')

plt.figure(figsize=(12, 9))

sns.regplot('num_beds', 'length_stay', data=senic, 
            color='darkorange', lowess=True)

plt.xlabel('Number of Beds', fontsize=14)
plt.ylabel('Length of  Stay', fontsize=14)

plt.show()

If `lowess=True`, `regplot()` uses `statsmodels` to estimate a nonparametric _lowess_ model (locally weighted linear regression). Note that confidence intervals cannot currently be drawn for this kind of model.

On the other hand, `ggplot()` in R uses _loess_ (locally estimated scatterplot smoothing) to draw the smooth line.

### Correlation matrix heatmap

In [None]:
corr_matrix = senic.corr()

plt.figure(figsize=(12, 9))

sns.heatmap(corr_matrix)

Drop the ID and categorical columns.

In [None]:
cols = [col for col in senic.columns if col not in ('hospital_ID', 'region_num')]

corr_matrix = senic[cols].corr()

plt.figure(figsize=(12, 9))

sns.heatmap(corr_matrix)

Change the color palette.

In [None]:
cmap = sns.color_palette("colorblind")
                  
corr_matrix = senic[cols].corr()

plt.figure(figsize=(12, 9))

sns.heatmap(corr_matrix, cmap=cmap)

Note: The 'colorbling' color scheme is only for illustrative purpose here. It's not recommened for a correlation matrix heatmap, because it doesn't show the correlation spectrum (low to high contrast). 

Define a custom color palette.

In [None]:
sns.palplot(sns.diverging_palette(10, 220, n=10))

Use a custom color palette.

In [None]:
cmap = sns.diverging_palette(10, 220, n=10)

corr_matrix = senic[cols].corr()

plt.figure(figsize=(12, 9))

sns.heatmap(corr_matrix, cmap=cmap)