### <span style="color:black"><b>Pandas Tutorial 3</b></span>

---

<ins>Selecting & renaming columns</ins>

* In this tutorial you'll see how you can select, create and rename columns using the pandas library
* Basic knowledge on python [dictionaries](https://www.w3schools.com/python/python_dictionaries.asp) is useful to know in order to get the most out of this video
* [fstrings](https://realpython.com/python-f-strings/) and [list comprehensions](https://www.programiz.com/python-programming/list-comprehension) are a bonus but it isn't essential
* I'll also introduce the controversial `inplace` argument

Useful dataframe methods:
<pre>
df.select_dtypes()
df.rename(columns = ..., inplace = True)
</pre>

In [1]:
import pandas as pd

In [2]:
# Recap of the python dictionary
mydict = {'Name':'Noah', 'Age':19}
mydict['Age']

19

In [4]:
# Read in data
df = pd.read_csv('twitch.csv')

# First five rows
df.head()

Unnamed: 0,Channel,Peak viewers,Followers gained,Views gained,Language,T
0,xQcOW,222720,1734810,93036735,English,449
1,summit1g,310998,1370184,89705964,English,376
2,Gaules,387315,1023779,102611607,Portuguese,778
3,ESL_CSGO,300575,703986,106546942,English,977
4,Tfue,285644,2068424,78998587,English,748


<span style="color:black"><b>Selecting A Pandas Series (Column)</b></span> 


Exercise: Select a column in two different ways, outlining the advantages and disadvantages of each method of doing so

In [5]:
# Method 1: Use dictionary syntax
df['Peak viewers']

0      222720
1      310998
2      387315
3      300575
4      285644
        ...  
995     21359
996      3940
997      6431
998     10543
999     13788
Name: Peak viewers, Length: 1000, dtype: int64

In [6]:
# Method 2: Use dot syntax
df.Channel

0                 xQcOW
1              summit1g
2                Gaules
3              ESL_CSGO
4                  Tfue
             ...       
995           LITkillah
996    빅헤드 (bighead033)
997      마스카 (newmasca)
998       AndyMilonakis
999                Remx
Name: Channel, Length: 1000, dtype: object

In [11]:
# Example 1 of where dot syntax fails 
# df.T 
df.T

# Reason: it clashes with pandas functionality of doing a matrix transpose

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
Channel,xQcOW,summit1g,Gaules,ESL_CSGO,Tfue,Asmongold,NICKMERCS,Fextralife,loltyler1,Anomaly,...,KEEMSTAR,크캣66 (crazzyccat),RelaxBeats,LAGTVMaximusBlack,Destructoid,LITkillah,빅헤드 (bighead033),마스카 (newmasca),AndyMilonakis,Remx
Peak viewers,222720,310998,387315,300575,285644,263720,115633,68795,89387,125408,...,74195,2543,2830,7138,14566,21359,3940,6431,10543,13788
Followers gained,1734810,1370184,1023779,703986,2068424,554201,1089824,425468,951730,1532689,...,46367,24849,29595,13251,8995,562691,52289,-4942,109111,59432
Views gained,93036735,89705964,102611607,106546942,78998587,61715781,46084211,670137548,51349926,36350662,...,7139253,1889696,1094850,2310313,87603521,2162107,4399897,3417970,3926918,2049420
Language,English,English,Portuguese,English,English,English,English,English,English,English,...,English,Korean,English,English,English,Spanish,Korean,Korean,English,French
T,449,376,778,977,748,465,970,621,762,424,...,30,27,549,382,257,829,424,656,691,967


In [13]:
# Example 2 of where dot syntax fails 
# df.Peak viewers 

# Reason: The name of the series has whitespace between the words

 <span style="color:black"><b>Selecting Multiple Columns</b></span>

It turns out that when selecting multiple colums, it is best to pass in the names of the columns as strings inside a python list.

Exercise: Select just the 'Channel', 'Followers gained' and 'Language' columns

In [16]:
df[['Channel', 'Followers gained', 'Language']]

Unnamed: 0,Channel,Followers gained,Language
0,xQcOW,1734810,English
1,summit1g,1370184,English
2,Gaules,1023779,Portuguese
3,ESL_CSGO,703986,English
4,Tfue,2068424,English
...,...,...,...
995,LITkillah,562691,Spanish
996,빅헤드 (bighead033),52289,Korean
997,마스카 (newmasca),-4942,Korean
998,AndyMilonakis,109111,English


<span style="color:black"><b>Selecting Only Numeric or String Columns</b></span> 

* For this we use the `select_dtypes()` dataframe method that allows us to specify that we want numeric column or those that are text
* Pass in a dtype for the `include` argument. A list with multiple dtypes is allowed too
* Documentation for `select_dtypes()` can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html) 

In [18]:
# Numeric only
df.select_dtypes('number')

Unnamed: 0,Peak viewers,Followers gained,Views gained,T
0,222720,1734810,93036735,449
1,310998,1370184,89705964,376
2,387315,1023779,102611607,778
3,300575,703986,106546942,977
4,285644,2068424,78998587,748
...,...,...,...,...
995,21359,562691,2162107,829
996,3940,52289,4399897,424
997,6431,-4942,3417970,656
998,10543,109111,3926918,691


In [19]:
# Text/object columns only 
df.select_dtypes('object')

Unnamed: 0,Channel,Language
0,xQcOW,English
1,summit1g,English
2,Gaules,Portuguese
3,ESL_CSGO,English
4,Tfue,English
...,...,...
995,LITkillah,Spanish
996,빅헤드 (bighead033),Korean
997,마스카 (newmasca),Korean
998,AndyMilonakis,English


<span style="color:black"><b>Creating New Columns</b></span> 

* Creating new columns is not a tedious nor difficult task when working with pandas dataframes
* In fact, when creating a new column, we can use our dictionary syntax (dot syntax not allowed here) as if the column was already in our dataset to start off with

Exercise: Create a new column named 'Testing' that has the value of 99.9 (just to see how this can be done)

In [21]:
# Solution:
df['Testing'] = 99.9

In [22]:
# First three rows
df.head(3)

Unnamed: 0,Channel,Peak viewers,Followers gained,Views gained,Language,T,Testing
0,xQcOW,222720,1734810,93036735,English,449,99.9
1,summit1g,310998,1370184,89705964,English,376,99.9
2,Gaules,387315,1023779,102611607,Portuguese,778,99.9


Exercise: Create a new pandas series called 'Followers gained/Views gained ratio' where we take Followers gained/Views gained and create a ratio

In [24]:
# Solution:
df['followers/views ratio'] = df['Followers gained'] / df['Views gained']

In [26]:
# First three rows to see if things worked
df.head(3)

Unnamed: 0,Channel,Peak viewers,Followers gained,Views gained,Language,T,Testing,followers/views ratio
0,xQcOW,222720,1734810,93036735,English,449,99.9,0.018647
1,summit1g,310998,1370184,89705964,English,376,99.9,0.015274
2,Gaules,387315,1023779,102611607,Portuguese,778,99.9,0.009977


Exercise: Create a new pandas series called 'followers gained bigger than views gained???' that returns True if yes, otherwise False

In [28]:
# Solution
df['followers gained bigger than views gained???'] = df['Followers gained'] > df['Views gained']

In [29]:
# See if the changes worked
df.head()

Unnamed: 0,Channel,Peak viewers,Followers gained,Views gained,Language,T,Testing,followers/views ratio,followers gained bigger than views gained???
0,xQcOW,222720,1734810,93036735,English,449,99.9,0.018647,False
1,summit1g,310998,1370184,89705964,English,376,99.9,0.015274,False
2,Gaules,387315,1023779,102611607,Portuguese,778,99.9,0.009977,False
3,ESL_CSGO,300575,703986,106546942,English,977,99.9,0.006607,False
4,Tfue,285644,2068424,78998587,English,748,99.9,0.026183,False


Extension exercise: Create a new pandas series called 'description' that gives a summary of the channel

* The first row shoukld say 'xQcOW' is a(n) English channel'
* The second row should say that 'summit1g is a(n) English channel'
* The third row should say that 'Gaules is a(n) Portuguese channel'

Prerequisites:
* Knowledge about list comprehensions and `zip()`

OR

* Knowledge of the pandas `df.apply()` method

In [32]:
# Solution
df['description'] = [f"{channel} is a(n) {language} twitch channel" for channel, language in zip(df['Channel'], df['Language'])]
df

Unnamed: 0,Channel,Peak viewers,Followers gained,Views gained,Language,T,Testing,followers/views ratio,followers gained bigger than views gained???,description
0,xQcOW,222720,1734810,93036735,English,449,99.9,0.018647,False,xQcOW is a(n) English twitch channel
1,summit1g,310998,1370184,89705964,English,376,99.9,0.015274,False,summit1g is a(n) English twitch channel
2,Gaules,387315,1023779,102611607,Portuguese,778,99.9,0.009977,False,Gaules is a(n) Portuguese twitch channel
3,ESL_CSGO,300575,703986,106546942,English,977,99.9,0.006607,False,ESL_CSGO is a(n) English twitch channel
4,Tfue,285644,2068424,78998587,English,748,99.9,0.026183,False,Tfue is a(n) English twitch channel
...,...,...,...,...,...,...,...,...,...,...
995,LITkillah,21359,562691,2162107,Spanish,829,99.9,0.260251,False,LITkillah is a(n) Spanish twitch channel
996,빅헤드 (bighead033),3940,52289,4399897,Korean,424,99.9,0.011884,False,빅헤드 (bighead033) is a(n) Korean twitch channel
997,마스카 (newmasca),6431,-4942,3417970,Korean,656,99.9,-0.001446,False,마스카 (newmasca) is a(n) Korean twitch channel
998,AndyMilonakis,10543,109111,3926918,English,691,99.9,0.027785,False,AndyMilonakis is a(n) English twitch channel


<span style="color:black"><b>Renaming Columns</b></span> 

A very common way to rename columns is to use the `df.rename()` command from pandas

Exercise: Rename the pandas series 'Channel' to 'Account Name' and rename 'Langauge' to 'Nationality'

* For this we need to use the [rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) dataframe method from pandas 
* When using the `columns` argument, The key of the dictionary is the old series name and the value is the one you want to change it to. Easy!


In [36]:
# Solution
df.rename(columns={'Channel':'Account Name', 'Language':'Nationality'}, inplace=True)

In [37]:
# Check the changes
df.head(10)

Unnamed: 0,Account Name,Peak viewers,Followers gained,Views gained,Nationality,T,Testing,followers/views ratio,followers gained bigger than views gained???,description
0,xQcOW,222720,1734810,93036735,English,449,99.9,0.018647,False,xQcOW is a(n) English twitch channel
1,summit1g,310998,1370184,89705964,English,376,99.9,0.015274,False,summit1g is a(n) English twitch channel
2,Gaules,387315,1023779,102611607,Portuguese,778,99.9,0.009977,False,Gaules is a(n) Portuguese twitch channel
3,ESL_CSGO,300575,703986,106546942,English,977,99.9,0.006607,False,ESL_CSGO is a(n) English twitch channel
4,Tfue,285644,2068424,78998587,English,748,99.9,0.026183,False,Tfue is a(n) English twitch channel
5,Asmongold,263720,554201,61715781,English,465,99.9,0.00898,False,Asmongold is a(n) English twitch channel
6,NICKMERCS,115633,1089824,46084211,English,970,99.9,0.023649,False,NICKMERCS is a(n) English twitch channel
7,Fextralife,68795,425468,670137548,English,621,99.9,0.000635,False,Fextralife is a(n) English twitch channel
8,loltyler1,89387,951730,51349926,English,762,99.9,0.018534,False,loltyler1 is a(n) English twitch channel
9,Anomaly,125408,1532689,36350662,English,424,99.9,0.042164,False,Anomaly is a(n) English twitch channel


* The reason the change wasn't actually implemented was because we did not set a special argument called the `inplace = True` argument.
* The idea is that pandas does not want to make any permanent changes without us being fully aware of it
* So by not seting `inplace = True`, it only gives us a preview of what the change would look like, but doesn't actually make the change
* If we are sure that our change is the the one we want to make, we must set the `inplace` argument to `True` to commit to the changes