<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Manipulating-a-Dataframe" data-toc-modified-id="Manipulating-a-Dataframe-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Manipulating a Dataframe</a></span></li><li><span><a href="#Column-Operations" data-toc-modified-id="Column-Operations-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Column Operations</a></span><ul class="toc-item"><li><span><a href="#File-Read-direct-from-Web-Request" data-toc-modified-id="File-Read-direct-from-Web-Request-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>File Read direct from Web Request</a></span></li><li><span><a href="#Column-Encoding" data-toc-modified-id="Column-Encoding-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Column Encoding</a></span></li><li><span><a href="#Column-Translation" data-toc-modified-id="Column-Translation-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Column Translation</a></span></li><li><span><a href="#Drop-Column" data-toc-modified-id="Drop-Column-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Drop Column</a></span></li></ul></li><li><span><a href="#Data-Types" data-toc-modified-id="Data-Types-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Types</a></span><ul class="toc-item"><li><span><a href="#Changing-Data-Types" data-toc-modified-id="Changing-Data-Types-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Changing Data Types</a></span></li></ul></li><li><span><a href="#Applying-Functions" data-toc-modified-id="Applying-Functions-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Applying Functions</a></span></li><li><span><a href="#Grouping" data-toc-modified-id="Grouping-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Grouping</a></span><ul class="toc-item"><li><span><a href="#Group-by-Multiple-Fields" data-toc-modified-id="Group-by-Multiple-Fields-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Group by Multiple Fields</a></span></li></ul></li></ul></div>

# 5.2 Pandas Operations

## Manipulating a Dataframe
- Common ETL task
- Done because all data is dirty
- Common base processing
   - Calculate columns values
   - Translate columns
   - Remove columns

## Column Operations

In [15]:
import io
import pandas as pd
import requests as r

#variables needed for ease of file access
url = 'http://drd.ba.ttu.edu/isqs6339/ex/L2.1/'
file_1 = 'test_data.csv'
file_2 = 'test_data_bad.csv'

### File Read direct from Web Request 

In [3]:
res = r.get(url + file_1)
res.status_code
df = pd.read_csv(io.StringIO(res.text))  
df

Unnamed: 0,RecordId,Continuous,Nominal,Ordinal_7pt,Ordinal_5pt
0,1,41.805702,Green,7,5
1,2,77.210218,Green,3,2
2,3,23.171868,Blue,7,1
3,4,14.442841,Violet,4,5
4,5,3.494745,Blue,5,1
5,6,76.314918,Yellow,1,5
6,7,35.584444,Blue,3,1
7,8,73.576334,Green,7,5
8,9,84.563229,Orange,3,5
9,10,92.856276,Indigo,5,4


### Column Encoding

Add a column with all values equal to 100:
- Pandas adds a column whenever you reference it

In [4]:
df['100_col'] = 100

Reference this new column to create another column
- Notice new columns

In [6]:
df['likert_100'] = 100/df['Ordinal_7pt']
df

Unnamed: 0,RecordId,Continuous,Nominal,Ordinal_7pt,Ordinal_5pt,100_col,likert_100
0,1,41.805702,Green,7,5,100,14.285714
1,2,77.210218,Green,3,2,100,33.333333
2,3,23.171868,Blue,7,1,100,14.285714
3,4,14.442841,Violet,4,5,100,25.0
4,5,3.494745,Blue,5,1,100,20.0
5,6,76.314918,Yellow,1,5,100,100.0
6,7,35.584444,Blue,3,1,100,33.333333
7,8,73.576334,Green,7,5,100,14.285714
8,9,84.563229,Orange,3,5,100,33.333333
9,10,92.856276,Indigo,5,4,100,20.0


How to not iterate in Pandas
- `df.iterrows` should not be used to update columns

In [7]:
# Doesn't work intentionally 
for index, row in df.iterrows():
    row['iterval'] = row['Ordinal_7pt'] + row['Ordinal_5pt']
df

Unnamed: 0,RecordId,Continuous,Nominal,Ordinal_7pt,Ordinal_5pt,100_col,likert_100
0,1,41.805702,Green,7,5,100,14.285714
1,2,77.210218,Green,3,2,100,33.333333
2,3,23.171868,Blue,7,1,100,14.285714
3,4,14.442841,Violet,4,5,100,25.0
4,5,3.494745,Blue,5,1,100,20.0
5,6,76.314918,Yellow,1,5,100,100.0
6,7,35.584444,Blue,3,1,100,33.333333
7,8,73.576334,Green,7,5,100,14.285714
8,9,84.563229,Orange,3,5,100,33.333333
9,10,92.856276,Indigo,5,4,100,20.0


### Column Translation 

Dumping data into 'buckets'
- Here we mark everything above 4 as high, the remainder as low
- This method isn't best practice, a little brute force-y

In [9]:
# set default value
df['7highlow'] = 'low'

# set the other value
df['7highlow'][df['Ordinal_7pt'] > 4] = 'high'

df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,RecordId,Continuous,Nominal,Ordinal_7pt,Ordinal_5pt,100_col,likert_100,7highlow
0,1,41.805702,Green,7,5,100,14.285714,high
1,2,77.210218,Green,3,2,100,33.333333,low
2,3,23.171868,Blue,7,1,100,14.285714,high
3,4,14.442841,Violet,4,5,100,25.0,low
4,5,3.494745,Blue,5,1,100,20.0,high
5,6,76.314918,Yellow,1,5,100,100.0,low
6,7,35.584444,Blue,3,1,100,33.333333,low
7,8,73.576334,Green,7,5,100,14.285714,high
8,9,84.563229,Orange,3,5,100,33.333333,low
9,10,92.856276,Indigo,5,4,100,20.0,high


### Drop Column
- `df.columns` shows us current columns
- `df.drop` drops columns
    - `axis` parameter specifies column
- Axes in Pandas
    - Axis = 1 specifies column
    - Axis = 2 specifes row
- Note that we actually created a new dataframe 'df1' and copied in our new dataframe with the columns dropped
    - To avoid this, we can add parameter `inplace=True`

In [10]:
df.columns
df1 = df.drop('7highlow', axis=1) #axis specifies column
df1
# Note, this created a new copy of dataframe: 'df1'

# Instead of copying we can do the following:
df
df.drop('7highlow', axis=1, inplace=True)
df

Unnamed: 0,RecordId,Continuous,Nominal,Ordinal_7pt,Ordinal_5pt,100_col,likert_100
0,1,41.805702,Green,7,5,100,14.285714
1,2,77.210218,Green,3,2,100,33.333333
2,3,23.171868,Blue,7,1,100,14.285714
3,4,14.442841,Violet,4,5,100,25.0
4,5,3.494745,Blue,5,1,100,20.0
5,6,76.314918,Yellow,1,5,100,100.0
6,7,35.584444,Blue,3,1,100,33.333333
7,8,73.576334,Green,7,5,100,14.285714
8,9,84.563229,Orange,3,5,100,33.333333
9,10,92.856276,Indigo,5,4,100,20.0


## Data Types
- Data types determine what actions can be done
    - E.g.: concatenation vs arithmetic
- Pandas tries to guess your data type
    - Also differs from Python data types
    ![5.2-Pandas_Python_Datatypes.png](attachment:5.2-Pandas_Python_Datatypes.png)
    - 'category' type is for categorical data and not present in Python

Setup for "bad data" example
- Create variables
- Read from web request

In [16]:
res = r.get(url + file_2)
res.status_code
df = pd.read_csv(io.StringIO(res.text))  
df

Unnamed: 0,RecordId,Continuous,Nominal,Ordinal_7pt,Ordinal_5pt
0,1,41.805702,Green,7,a
1,2,77.210218,Green,3,2
2,3,23.171868,Blue,7,1
3,4,14.442841,Violet,4,5
4,5,3.494745,Blue,5,1
5,6,76.314918,Yellow,1,5
6,7,35.584444,Blue,3,1
7,8,73.576334,Green,7,5
8,9,84.563229,Orange,3,5
9,10,92.856276,Indigo,5,4


Notice missing fields:

In [17]:
df.describe()
df.dtypes

RecordId         int64
Continuous     float64
Nominal         object
Ordinal_7pt     object
Ordinal_5pt     object
dtype: object

Examine the bad read:

In [19]:
res = r.get(url + file_2)
res.status_code
df = pd.read_csv(io.StringIO(res.text))  

Notice numbers are coming up as objects(strings)
- Pandas will treat an entire column as object datatype if it runs into non-numeric data

In [None]:
df.dtypes

Get counts of values in the columns with object datatypes
- We can see one letter appearing in each of these columns

In [20]:
df['Ordinal_7pt'].value_counts()
df['Ordinal_5pt'].value_counts()

1    15
4    11
5    10
2     9
3     4
a     1
Name: Ordinal_5pt, dtype: int64

For now, we will just drop the row
- Not the ideal way to clean data, but just an example
- Find the indexes of the bad data rows and dropw them
- We are deleting rows here, so axis = 0

In [22]:
df.drop(0, axis=0, inplace=True)
df.drop(49, axis=0, inplace=True)
df

Unnamed: 0,RecordId,Continuous,Nominal,Ordinal_7pt,Ordinal_5pt
1,2,77.210218,Green,3,2
2,3,23.171868,Blue,7,1
3,4,14.442841,Violet,4,5
4,5,3.494745,Blue,5,1
5,6,76.314918,Yellow,1,5
6,7,35.584444,Blue,3,1
7,8,73.576334,Green,7,5
8,9,84.563229,Orange,3,5
9,10,92.856276,Indigo,5,4
10,11,67.363326,Violet,4,1


How did we drop the last row after removing the first?
- Typically the index should have shifted and position 49 should have become 48
- Pandas does not shift indexes when you remove them
- If we remove index 7, it would skip from 6 to 8
- May want to reindex after drop operations
- `df.reset_index(drop=True, inplace=True)
    - Resets the index

In [23]:
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,RecordId,Continuous,Nominal,Ordinal_7pt,Ordinal_5pt
0,2,77.210218,Green,3,2
1,3,23.171868,Blue,7,1
2,4,14.442841,Violet,4,5
3,5,3.494745,Blue,5,1
4,6,76.314918,Yellow,1,5
5,7,35.584444,Blue,3,1
6,8,73.576334,Green,7,5
7,9,84.563229,Orange,3,5
8,10,92.856276,Indigo,5,4
9,11,67.363326,Violet,4,1


### Changing Data Types

In [None]:
df.dtypes

The values are now correct, but the datatype is still wrong
- `pd.to_numeric(df['variable'])
    - Converts value to numeric datatype
    - Pass in the necessary columns from the df 

In [24]:
df['Ordinal_7pt'] = pd.to_numeric(df['Ordinal_7pt'])
df['Ordinal_5pt'] = pd.to_numeric(df['Ordinal_5pt'])

Check types and describe!

In [25]:
#Types look good
df.dtypes
df.describe()

Unnamed: 0,RecordId,Continuous,Ordinal_7pt,Ordinal_5pt
count,48.0,48.0,48.0,48.0
mean,25.5,52.093181,4.375,2.8125
std,14.0,28.198455,1.770022,1.579877
min,2.0,3.37045,1.0,1.0
25%,13.75,30.449949,3.0,1.0
50%,25.5,56.968931,4.0,2.5
75%,37.25,76.538743,6.0,4.0
max,49.0,100.75333,7.0,5.0


## Applying Functions
- Pandas supports an Apply function
    - Allows a function to 'apply' all items on an axis
- Same function works for rows and columns

Setup:

In [26]:
url = 'http://drd.ba.ttu.edu/isqs6339/ex/L2.1/'
file_1 = 'test_data.csv'

res = r.get(url + file_1)
res.status_code
df = pd.read_csv(io.StringIO(res.text)) 

Recall from earlier that this doesn't work:

In [None]:
#Intentionally broken
for index, row in df.iterrows():
    row['iterval'] = row['Ordinal_7pt'] + row['Ordinal_5pt'] + index

Define a function to sum two columns in row

In [27]:
def ComputeVals(row):
    return row['Ordinal_7pt'] + row['Ordinal_5pt']

`df.apply(functionname,axis)`
- Use apply function to apply our function to axis 1 (rows)

In [28]:
df['encoded_likert'] = df.apply(ComputeVals, axis=1)
df

Unnamed: 0,RecordId,Continuous,Nominal,Ordinal_7pt,Ordinal_5pt,encoded_likert
0,1,41.805702,Green,7,5,12
1,2,77.210218,Green,3,2,5
2,3,23.171868,Blue,7,1,8
3,4,14.442841,Violet,4,5,9
4,5,3.494745,Blue,5,1,6
5,6,76.314918,Yellow,1,5,6
6,7,35.584444,Blue,3,1,4
7,8,73.576334,Green,7,5,12
8,9,84.563229,Orange,3,5,8
9,10,92.856276,Indigo,5,4,9


## Grouping

`df.groupby('variable')`
- Use to group data
- Here we group values in the nominal column and return the means

In [29]:
df.groupby('Nominal').mean()

Unnamed: 0_level_0,RecordId,Continuous,Ordinal_7pt,Ordinal_5pt,encoded_likert
Nominal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Blue,21.5,41.470501,4.083333,2.0,6.083333
Green,25.7,58.984495,5.3,3.0,8.3
Indigo,32.75,61.158636,5.125,3.5,8.625
Orange,23.6,50.995312,4.6,2.4,7.0
Red,27.285714,63.44162,3.857143,2.571429,6.428571
Violet,23.8,27.896101,4.2,4.0,8.2
Yellow,23.333333,65.415074,2.333333,4.0,6.333333


### Group by Multiple Fields
- Create our high/low column from previous lesson
- Pass both columns into `df.groupby` as a dictionary (use brackets) and get the means

In [30]:
df['7highlow'] = 'low'
#set the other value
df['7highlow'][df['Ordinal_7pt'] > 4] = 'high'

df.groupby(['Nominal', '7highlow']).mean()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,Unnamed: 1_level_0,RecordId,Continuous,Ordinal_7pt,Ordinal_5pt,encoded_likert
Nominal,7highlow,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Blue,high,10.6,27.428431,5.6,2.0,7.6
Blue,low,29.285714,51.500551,3.0,2.0,5.0
Green,high,27.142857,63.668165,6.571429,3.285714,9.857143
Green,low,22.333333,48.05593,2.333333,2.333333,4.666667
Indigo,high,29.8,55.532045,6.0,4.0,10.0
Indigo,low,37.666667,70.536287,3.666667,2.666667,6.333333
Orange,high,31.0,54.991894,5.666667,2.0,7.666667
Orange,low,12.5,45.000441,3.0,3.0,6.0
Red,high,34.666667,40.613895,5.666667,2.333333,8.0
Red,low,21.75,80.562413,2.5,2.75,5.25
