# Data Manipulation and Plotting with `pandas`

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

![pandas](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/2880px-Pandas_logo.svg.png)

## Learning Goals

- Load .csv files into `pandas` DataFrames
- Describe and manipulate data in Series and DataFrames
- Visualize data using DataFrame methods and `matplotlib`

## What is Pandas?

Pandas, as [the Anaconda docs](https://docs.anaconda.com/anaconda/packages/py3.7_osx-64/) tell us, offers us "High-performance, easy-to-use data structures and data analysis tools." It's something like "Excel for Python", but it's quite a bit more powerful.

Let's read in the heart dataset.

Pandas has many methods for reading different types of files. Note that here we have a .csv file.

Read about this dataset [here](https://www.kaggle.com/ronitf/heart-disease-uci).

In [8]:
heart_df = pd.read_csv('heart.csv')

The output of the `.read_csv()` function is a pandas *DataFrame*, which has a familiar tabaular structure of rows and columns.

In [9]:
type(heart_df)

pandas.core.frame.DataFrame

In [16]:
heart_df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,new col
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,0
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,3
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0,298
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0,299
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0,300
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0,301


## DataFrames and Series

Two main types of pandas objects are the DataFrame and the Series, the latter being in effect a single column of the former:

In [13]:
age_series = heart_df['age']
type(age_series)

pandas.core.series.Series

In [17]:
age_series = heart_df['age']
type(age_series)

pandas.core.series.Series

In [19]:
age_series = heart_df.age

Notice how we can isolate a column of our DataFrame simply by using square brackets together with the name of the column.

Both Series and DataFrames have an *index* as well:

In [20]:
heart_df.index

RangeIndex(start=0, stop=303, step=1)

In [21]:
age_series.index

RangeIndex(start=0, stop=303, step=1)

Pandas is built on top of NumPy, and we can always access the NumPy array underlying a DataFrame using `.values`.

In [22]:
heart_df.values

array([[ 63.,   1.,   3., ...,   1.,   1.,   0.],
       [ 37.,   1.,   2., ...,   2.,   1.,   1.],
       [ 41.,   0.,   1., ...,   2.,   1.,   2.],
       ...,
       [ 68.,   1.,   0., ...,   3.,   0., 300.],
       [ 57.,   1.,   0., ...,   3.,   0., 301.],
       [ 57.,   0.,   1., ...,   2.,   0., 302.]])

## Basic DataFrame Attributes and Methods

### `.head()`

In [23]:
heart_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,new col
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,0
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,3
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,4


### `.tail()`

In [None]:
heart_df.tail()

### `.info()`

In [None]:
heart_df.info()

### `.describe()`

In [None]:
heart_df.describe()

### `.dtypes`

In [None]:
heart_df.dtypes

### `.shape`

In [None]:
heart_df.shape

### Exploratory Plots

Let's make ourselves a histogram of ages:

In [None]:
sns.set_style('darkgrid')
sns.distplot(a=heart_df['age']);

And while we're at it let's do a scatter plot of maximum heart rate vs. age:

In [None]:
sns.scatterplot(x=heart_df['age'], y=heart_df['thalach']);

## Adding to a DataFrame


### Adding Rows

Here are two rows that our engineer accidentally left out of the .csv file, expressed as a Python dictionary:

In [None]:
extra_rows = {'age': [40, 30], 'sex': [1, 0], 'cp': [0, 0], 'trestbps': [120, 130],
              'chol': [240, 200],
             'fbs': [0, 0], 'restecg': [1, 0], 'thalach': [120, 122], 'exang': [0, 1],
              'oldpeak': [0.1, 1.0], 'slope': [1, 1], 'ca': [0, 1], 'thal': [2, 3],
              'target': [0, 0]}
extra_rows

How can we add this to the bottom of our dataset?

In [None]:
# Let's first turn this into a DataFrame.
# We can use the .from_dict() method.

missing = pd.DataFrame(extra_rows)
missing

In [37]:
# Now we just need to concatenate the two DataFrames together.
# Note the `ignore_index` parameter! We'll set that to True.

heart_augmented = pd.concat([heart_df, missing],
                           ignore_index=True)

NameError: name 'missing' is not defined

In [28]:
# Let's check the end to make sure we were successful!

heart_augmented.tail()

NameError: name 'heart_augmented' is not defined

### Adding Columns

Adding a column is very easy in `pandas`. Let's add a new column to our dataset called "test", and set all of its values to 0.

In [27]:
heart_augmented['test'] = 0

NameError: name 'heart_augmented' is not defined

In [None]:
heart_augmented.head()

I can also add columns whose values are functions of existing columns.

Suppose I want to add the cholesterol column ("chol") to the resting systolic blood pressure column ("trestbps"):

In [26]:
heart_augmented['chol+trestbps'] = heart_augmented['chol'] + heart_augmented['trestbps']

NameError: name 'heart_augmented' is not defined

In [None]:
heart_augmented.head()

## Filtering

We can use filtering techniques to see only certain rows of our data. If we wanted to see only the rows for patients 70 years of age or older, we can simply type:

In [25]:
heart_augmented[heart_augmented['age'] >= 70]

NameError: name 'heart_augmented' is not defined

Use '&' for "and" and '|' for "or".

### Exercise

Display the patients who are 70 or over as well as the patients whose trestbps score is greater than 170.

In [32]:
heart_augmented[(heart_augmented['age'] >= 70) | (heart_augmented['trestpbs'] > 170)]

NameError: name 'heart_augmented' is not defined

<details>
    <summary>Answer</summary>
    <code>heart_augmented[(heart_augmented['age'] >= 70) | (heart_augmented['trestbps'] > 170)]</code>
    </details>

### Exploratory Plot

Using the subframe we just made, let's make a scatter plot of their cholesterol levels vs. age and color by sex:

In [31]:
at_risk = heart_augmented[(heart_augmented['age'] >= 70) | (heart_augmented['trestpbs'] > 170)]

sns.scatterplot(data=at_risk, x='age', y='chol', hue='sex');

NameError: name 'heart_augmented' is not defined

### `.loc` and `.iloc`

We can use `.loc` to get, say, the first ten values of the age and resting blood pressure ("trestbps") columns:

In [33]:
heart_augmented.loc

NameError: name 'heart_augmented' is not defined

In [34]:
heart_augmented.loc[:9, ['age', 'trestbps']]

NameError: name 'heart_augmented' is not defined

`.iloc` is used for selecting locations in the DataFrame **by number**:

In [35]:
heart_augmented.iloc

NameError: name 'heart_augmented' is not defined

In [36]:
heart_augmented.iloc[3, 0]

NameError: name 'heart_augmented' is not defined

### Exercise

How would we get the same slice as just above by using .iloc() instead of .loc()?

In [40]:
heart_augmented.loc[[:9, [0,3]]

SyntaxError: invalid syntax (2727129832.py, line 1)

<details>
    <summary>Answer</summary>
    <code>heart_augmented.iloc[:10, [0, 3]]</code>
    </details>

## Statistics

### `.mean()`

In [None]:
heart_augmented.mean()

Be careful! Some of these will are not straightforwardly interpretable. What does an average "sex" of 0.682 mean?

### `.min()`

In [None]:
heart_augmented.min()

### `.max()`

In [None]:
heart_augmented.max()

## Series Methods

### `.value_counts()`

How many different values does have slope have? What about sex? And target?

In [None]:
heart_augmented['slope'].value_counts()

### `.sort_values()`

In [None]:
heart_augmented['age'].sort_values()

## `pandas`-Native Plotting

The `.plot()` and `.hist()` methods available for DataFrames use a wrapper around `matplotlib`:

In [None]:
heart_augmented.plot(x='age', y='trestbps', kind='scatter');

In [None]:
heart_augmented.hist(column='chol');

## Exercises

1. Make a bar plot of "age" vs. "slope" for the `heart_augmented` DataFrame.

<details>
    <summary>Answer</summary>
    <code>sns.barplot(data=heart_augmented, x='slope', y='age');</code>
    </details>

2. Make a histogram of ages for **just the men** in `heart_augmented` (heart_augmented['sex']=1).

In [41]:
heart_augmented[heart_augmented['sex'] ==1]['sex']

NameError: name 'heart_augmented' is not defined

In [43]:
sns.histplot(heart_augmented[heart_augmented['sex'] ==1]['sex'])

NameError: name 'heart_augmented' is not defined

<details>
    <summary>Answer</summary>
<code>men = heart_augmented[heart_augmented['sex'] == 1]
sns.distplot(a=men['age']);</code>
    </details>

3. Make separate scatter plots of cholesterol vs. resting systolic blood pressure for the target=0 and the target=1 groups. Put both plots on the same figure and give each an appropriate title.

In [44]:
target_zero = heart_augmented[heart_augmented['target'] ==0]
target_one = heart_augmented[heart_augmented['target'] ==0]


#scatter plot 1: cohl vs trestbps
sns.scatterplot(x='trestbps', y='chol')


#scatter plot 2

NameError: name 'heart_augmented' is not defined

In [48]:
import matplotlib.pyplot as plt 

fig, ax1, ax2 =plt.subplot(1,2)

TypeError: subplot() takes 1 or 3 positional arguments but 2 were given

<Figure size 640x480 with 0 Axes>

<details>
    <summary>Answer</summary>
<code>target0 = heart_augmented[heart_augmented['target'] == 0]
target1 = heart_augmented[heart_augmented['target'] == 1]
    
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
    
sns.scatterplot(data=target0, x='trestbps', y='chol', ax=ax[0])
sns.scatterplot(data=target1, x='trestbps', y='chol', ax=ax[1])
   
    
ax[0].set_title('Cholesterol Vs. Resting Blood Pressure in Women')
    
    
    ax[1].set_title('Cholesterol Vs. Resting Blood Pressure in Men');</code>
    </details>

## Let's find a .csv file online and experiment with it.

I'm going to head to [dataportals.org](https://dataportals.org) to find a .csv file.