In [130]:
import pandas as pd # A general purpose Python library for data analysis
import numpy as np # A library for scientific computing in Python (e.g., provides high-performance multi-dimensional array objects and operations)

import matplotlib.pyplot as plt # a plotting library for Python and NumPy (readily customizable)
import seaborn as sns # Another plotting library for Python (fewer syntax, excellent default themes, behind the scenes, it uses matplotlib)
import time

## Knowledge Streams 2024

In this notebook, we will learn about the key data structures provided by the Pandas library: **Data Frames, Series, and Indices**.

In addition, we will learn about the following operations:
* How to access data contained in these structures?
* How to read files (e.g., csv, xlsx, sql) to create these structures?
* How to carry out different data manipulation tasks using these structures?

`Dataset`: US elections with information about candidates, their party, votes won, year of election and the result.

## Reading in Data Frames from Files
We'll be using **read_csv** today. Note that this file reading function does all the *data parsing* for you, which is very useful.

Before loading a file into a dataframe, let's first take a look at the **elections.csv** file

In [131]:
#Load csv file and print shape
# Code here
df=pd.read_csv('./data/elections.csv')

# how many observation and features are given
#Code here
print(f'.\nThe data set has {df.shape[0]} observations and {df.shape[1]} features')

.
The data set has 182 observations and 6 features


In [132]:
# We can use the **head command** to show only a few rows of a dataframe from start.
# Code here
print(df.head())
#Use **tail command** to show last few observation.
# code here
print(df.tail())

   Year          Candidate                  Party  Popular vote Result  \
0  1824     Andrew Jackson  Democratic-Republican        151271   loss   
1  1824  John Quincy Adams  Democratic-Republican        113142    win   
2  1828     Andrew Jackson             Democratic        642806    win   
3  1828  John Quincy Adams    National Republican        500897   loss   
4  1832     Andrew Jackson             Democratic        702735    win   

           %  
0  57.210122  
1  42.789878  
2  56.203927  
3  43.796073  
4  54.574789  
     Year       Candidate        Party  Popular vote Result          %
177  2016      Jill Stein        Green       1457226   loss   1.073699
178  2020    Joseph Biden   Democratic      81268924    win  51.311515
179  2020    Donald Trump   Republican      74216154   loss  46.858542
180  2020    Jo Jorgensen  Libertarian       1865724   loss   1.177979
181  2020  Howard Hawkins        Green        405035   loss   0.255731


In [133]:
#The `read_csv` command lets us specify a **column to use an index**. For example, we could have used __Year__ as the index.
#Code here
df=pd.read_csv('./data/elections.csv',index_col='Year')
df.head()

Unnamed: 0_level_0,Candidate,Party,Popular vote,Result,%
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
1828,Andrew Jackson,Democratic,642806,win,56.203927
1828,John Quincy Adams,National Republican,500897,loss,43.796073
1832,Andrew Jackson,Democratic,702735,win,54.574789


In [134]:
#Alternately, we could have used the **set_index** commmand on the dataframe to set a particular column as index.
# code here
df.set_index('Party',inplace=True)
df.head()

Unnamed: 0_level_0,Candidate,Popular vote,Result,%
Party,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Democratic-Republican,Andrew Jackson,151271,loss,57.210122
Democratic-Republican,John Quincy Adams,113142,win,42.789878
Democratic,Andrew Jackson,642806,win,56.203927
National Republican,John Quincy Adams,500897,loss,43.796073
Democratic,Andrew Jackson,702735,win,54.574789


# Caution:
The **set_index command** (along with all other data frame methods) **does not modify the dataframe**, i.e., the original "elections" is untouched. Note: There is a flag called "inplace" which does modify the calling dataframe (e.g., `elections.set_index("Party",inplace=True)`).

## Duplicate Columns?
By contast, column names MUST be unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically any duplicates. Load duplicate_columns.csv

In [135]:
#Answer Here
duplicate=pd.read_csv('./data/duplicate_columns.csv')
duplicate

Unnamed: 0,name,name.1,flavor
0,john,smith,vanilla
1,zhang,shan,chocolate
2,fulan,alfulani,strawberry
3,hong,gildong,banana


## The [ ] Operator & Indexing

The DataFrame class has an indexing operator **[ ]** (also known as the 'brack' operator) that lets you do a variety of different things. If your provide a String to the **[ ]** operator, you get back a ***Series*** corresponding to the requested label.

1.Use **[ ]** to display different columns

2.Use List retrive multiple columns

In [137]:
# Display and Retrieve multiple columns from the election data frame, the resultant would be the list for every column.
#Code here
candidate_list = df['Candidate'].tolist()
popular_vote_list = df['Popular vote'].tolist()

# Display the lists
print("Candidate List:", candidate_list)
print("\nPopular Vote List:", popular_vote_list)

Candidate List: ['Andrew Jackson', 'John Quincy Adams', 'Andrew Jackson', 'John Quincy Adams', 'Andrew Jackson', 'Henry Clay', 'William Wirt', 'Hugh Lawson White', 'Martin Van Buren', 'William Henry Harrison', 'Martin Van Buren', 'William Henry Harrison', 'Henry Clay', 'James Polk', 'Lewis Cass', 'Martin Van Buren', 'Zachary Taylor', 'Franklin Pierce', 'John P. Hale', 'Winfield Scott', 'James Buchanan', 'John C. Frémont', 'Millard Fillmore', 'Abraham Lincoln', 'John Bell', 'John C. Breckinridge', 'Stephen A. Douglas', 'Abraham Lincoln', 'George B. McClellan', 'Horatio Seymour', 'Ulysses Grant', 'Horace Greeley', 'Ulysses Grant', 'Rutherford Hayes', 'Samuel J. Tilden', 'James B. Weaver', 'James Garfield', 'Winfield Scott Hancock', 'Benjamin Butler', 'Grover Cleveland', 'James G. Blaine', 'John St. John', 'Alson Streeter', 'Benjamin Harrison', 'Clinton B. Fisk', 'Grover Cleveland', 'Benjamin Harrison', 'Grover Cleveland', 'James B. Weaver', 'John Bidwell', 'John M. Palmer', 'Joshua Lever

In [93]:
#The **[ ]** operator also accepts a list of strings. In this case, you get back a **DataFrame** corresponding to the requested strings.
# code here
list=df[['Candidate', 'Party', 'Popular vote']]
list.head()

Unnamed: 0,Candidate,Party,Popular vote
0,Andrew Jackson,Democratic-Republican,151271
1,John Quincy Adams,Democratic-Republican,113142
2,Andrew Jackson,Democratic,642806
3,John Quincy Adams,National Republican,500897
4,Andrew Jackson,Democratic,702735


A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

Note that we can also use the **to_frame** method to turn a Series into a DataFrame.

Extract one col name "Candidates" from DataFrame it will be a series. Convert series into a DataFrame.

In [97]:
# Code here
candidate_series = df['Candidate']
candidate_series.head()

0       Andrew Jackson
1    John Quincy Adams
2       Andrew Jackson
3    John Quincy Adams
4       Andrew Jackson
Name: Candidate, dtype: object

In [98]:
# Code here
candidate_df = candidate_series.to_frame()
candidate_df.head()

Unnamed: 0,Candidate
0,Andrew Jackson
1,John Quincy Adams
2,Andrew Jackson
3,John Quincy Adams
4,Andrew Jackson


The following cells allow you to **test your understanding**. Let's go over the summary of what we have learnt (see slides).

# Creating DataFrames
Create DataFrame using List and Columns name given in slides.

In [111]:
# Code here
candidates_list=['Ali','Ahmed','Kashif']
votes_list=[33,55,56]
list_df=pd.DataFrame(candidates_list,votes_list)
list_df

Unnamed: 0,0
33,Ali
55,Ahmed
56,Kashif


Creating DataFrames using **Dictionary** given in slides.

In [105]:
# Code here
dictionary_df=pd.DataFrame({"Fruit":["Strawberry", "Orange"],
"Price": [5.49, 3.99]})
dictionary_df

Unnamed: 0,Fruit,Price
0,Strawberry,5.49
1,Orange,3.99


Creating DataFrames using **Series** given in slides.

In [139]:
# Code here
s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s=pd.DataFrame(s)
s

Unnamed: 0,0
a,-1
b,10
c,2
