In [19]:
import pandas as pd # A general purpose Python library for data analysis
import numpy as np # A library for scientific computing in Python (e.g., provides high-performance multi-dimensional array objects and operations)

import matplotlib.pyplot as plt # a plotting library for Python and NumPy (readily customizable)
import seaborn as sns # Another plotting library for Python (fewer syntax, excellent default themes, behind the scenes, it uses matplotlib)
import time

## Knowledge Stream Summer 2023

In this notebook, we will learn about the key data structures provided by the Pandas library: **Data Frames, Series, and Indices**.

In addition, we will learn about the following operations:
* How to access data contained in these structures?
* How to read files (e.g., csv, xlsx, sql) to create these structures?
* How to carry out different data manipulation tasks using these structures?

`Dataset`: US elections with information about candidates, their party, votes won, year of election and the result.

## Reading in Data Frames from Files

Pandas has a number of useful file reading tools. You can see them enumerated by typing **"pd.re"** and pressing `tab`. We'll be using **read_csv** today. Note that these file reading functions do all the *data parsing* for you, which is very useful.

Before loading a file into a dataframe, let's first take a look at the **elections.csv** file

In [131]:
df = pd.read_csv(r"C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1\pandas\elections.csv")
df

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


We can use the **head command** to show only a few rows of a dataframe.

# heading
## heading2

In [21]:
df.head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


There is also a **tail command**.

In [22]:
df.tail()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979
181,2020,Howard Hawkins,Green,405035,loss,0.255731


The `read_csv` command lets us specify a **column to use an index**. For example, we could have used __Year__ as the index.

In [9]:
df = pd.read_csv(r"C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1\pandas\elections.csv", index_col="Year")
df

Unnamed: 0_level_0,Candidate,Party,Popular vote,Result,%
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
1828,Andrew Jackson,Democratic,642806,win,56.203927
1828,John Quincy Adams,National Republican,500897,loss,43.796073
1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...
2016,Jill Stein,Green,1457226,loss,1.073699
2020,Joseph Biden,Democratic,81268924,win,51.311515
2020,Donald Trump,Republican,74216154,loss,46.858542
2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


Alternately, we could have used the **set_index** commmand on the dataframe.

In [23]:
df.set_index("Year",inplace=True)
df.head()

Unnamed: 0_level_0,Candidate,Party,Popular vote,Result,%
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
1828,Andrew Jackson,Democratic,642806,win,56.203927
1828,John Quincy Adams,National Republican,500897,loss,43.796073
1832,Andrew Jackson,Democratic,702735,win,54.574789


# Caution:
The **set_index command** (along with all other data frame methods) **does not modify the dataframe**, i.e., the original "elections" is untouched. Note: There is a flag called "inplace" which does modify the calling dataframe (e.g., `elections.set_index("Party",inplace=True)`).

## Duplicate Columns?
By contast, column names MUST be unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically any duplicates. Load duplicate_columns.csv

In [24]:
duplicate = pd.read_csv(r"C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1\pandas\duplicate_columns.csv")
duplicate

Unnamed: 0,name,name.1,flavor
0,john,smith,vanilla
1,zhang,shan,chocolate
2,fulan,alfulani,strawberry
3,hong,gildong,banana


## The [ ] Operator & Indexing

The DataFrame class has an indexing operator **[ ]** (also known as the 'brack' operator) that lets you do a variety of different things. If your provide a String to the **[ ]** operator, you get back a ***Series*** corresponding to the requested label.

1.Use **[ ]** to display different columns

2.Use List retrive multiple columns

In [29]:
single = df['Candidate']
print("single column \n ",single)
multi = df[['Candidate','Result']]
multi

single column 
  Year
1824       Andrew Jackson
1824    John Quincy Adams
1828       Andrew Jackson
1828    John Quincy Adams
1832       Andrew Jackson
              ...        
2016           Jill Stein
2020         Joseph Biden
2020         Donald Trump
2020         Jo Jorgensen
2020       Howard Hawkins
Name: Candidate, Length: 182, dtype: object


Unnamed: 0_level_0,Candidate,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1824,Andrew Jackson,loss
1824,John Quincy Adams,win
1828,Andrew Jackson,win
1828,John Quincy Adams,loss
1832,Andrew Jackson,win
...,...,...
2016,Jill Stein,loss
2020,Joseph Biden,win
2020,Donald Trump,loss
2020,Jo Jorgensen,loss


The **[ ]** operator also accepts a list of strings. In this case, you get back a **DataFrame** corresponding to the requested strings.

In [30]:
multi1 = df[['Candidate','Result','%']]
multi1

Unnamed: 0_level_0,Candidate,Result,%
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1824,Andrew Jackson,loss,57.210122
1824,John Quincy Adams,win,42.789878
1828,Andrew Jackson,win,56.203927
1828,John Quincy Adams,loss,43.796073
1832,Andrew Jackson,win,54.574789
...,...,...,...
2016,Jill Stein,loss,1.073699
2020,Joseph Biden,win,51.311515
2020,Donald Trump,loss,46.858542
2020,Jo Jorgensen,loss,1.177979


A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

Note that we can also use the **to_frame** method to turn a Series into a DataFrame.

Extract one col name "Candidates" from DataFrame it will be a series. Convert series into a DataFrame.

In [38]:
candidates_series = df['Candidate']
candidates_series

0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object

In [40]:
candidates_df = candidates_series.to_frame()
candidates_df

Unnamed: 0,Candidate
0,Andrew Jackson
1,John Quincy Adams
2,Andrew Jackson
3,John Quincy Adams
4,Andrew Jackson
...,...
177,Jill Stein
178,Joseph Biden
179,Donald Trump
180,Jo Jorgensen


### Row Indexing

The `[]` operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

Extract few rows from DataFrame

In [46]:
row_slice = df[0:1]
row_slice


Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122


In [48]:
specific_rows = df.iloc[[1, 3, 5]]
specific_rows

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
5,1832,Henry Clay,National Republican,484205,loss,37.603628


If you provide a single argument to the `[]` operator, it tries to use it as a name. This is true even if the argument passed to **[ ]** is an integer.

In [49]:
#elections[0] #this does not work, try uncommenting this to see it fail in action, woo
try:
    value = df[0]  
except KeyError:
    print("No column named '0'")

No column named '0'


The following cells allow you to **test your understanding**. Let's go over the summary of what we have learnt (see slides).

# Creating DataFrames
Create DataFrame using List and Columns name.

In [54]:
list1 = [

]

columns = ["Name","Age","Profession"]
NewDataFrame = pd.DataFrame(list1, columns=columns)
NewDataFrame

Unnamed: 0,Name,Age,Profession
0,Ali,26,Flutter Developer
1,Talha,25,Data Scientist
2,Faiq,24,Data Analyst


Creating DataFrames using **Dictionary**.

In [57]:
data = {
  'Name':  ['Ali','Talha','Faiq'],
    'Age' : [26,25,24],
    'Profession':['Flutter Developer','Data Scientist','Data Analyst']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Profession
0,Ali,26,Flutter Developer
1,Talha,25,Data Scientist
2,Faiq,24,Data Analyst


## Filtering via Boolean Array Selection

The `[]` operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a **filtered version of the data frame**, where **only rows corresponding to True appear**.

In [61]:
# elections[[False, False, False, False, False,
#           False, False, True, False, False,
#           True, False, False, False, True,
#           False, False, False, False, False,
#           False, True, False]]
bool_array = [True, False, True]
filtered= df[bool_array]
filtered

Unnamed: 0,Name,Age,Profession
0,Ali,26,Flutter Developer
2,Faiq,24,Data Analyst


One very common task in Data Science is **filtering**. Boolean Array Selection is one way to achieve this in Pandas. We start by observing that **logical operators** like the equality operator can be applied to **Pandas Series data** to generate a **Boolean Array**.

Compare the 'Result' column to the String 'win' and Show results

In [65]:
df
filtered1 = df[df['Result']=='win']
filtered1

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
8,1836,Martin Van Buren,Democratic,763291,win,52.272472
11,1840,William Henry Harrison,Whig,1275583,win,53.051213
13,1844,James Polk,Democratic,1339570,win,50.749477
16,1848,Zachary Taylor,Whig,1360235,win,47.309296
17,1852,Franklin Pierce,Democratic,1605943,win,51.013168
20,1856,James Buchanan,Democratic,1835140,win,45.30608
23,1860,Abraham Lincoln,Republican,1855993,win,39.699408


Compare the 'Party' column to the String 'Democratic' and Show results

In [67]:
filtered2 = df[df['Party']== 'Democratic']
filtered2

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
8,1836,Martin Van Buren,Democratic,763291,win,52.272472
10,1840,Martin Van Buren,Democratic,1128854,loss,46.948787
13,1844,James Polk,Democratic,1339570,win,50.749477
14,1848,Lewis Cass,Democratic,1223460,loss,42.552229
17,1852,Franklin Pierce,Democratic,1605943,win,51.013168
20,1856,James Buchanan,Democratic,1835140,win,45.30608
28,1864,George B. McClellan,Democratic,1812807,loss,45.048488
29,1868,Horatio Seymour,Democratic,2708744,loss,47.334695


The output of the logical operator applied to the Series is **another Series with the same name and index, but of datatype boolean**.

These boolean Series can be used as an argument to the `[]` operator.

Creates  DataFrame of all election winners since 1980.

In [70]:
filtered3 = df[(df['Result']=='win') & (df['Year'] >= 1980 )]
filtered3

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
131,1980,Ronald Reagan,Republican,43903230,win,50.897944
133,1984,Ronald Reagan,Republican,54455472,win,59.023326
135,1988,George H. W. Bush,Republican,48886597,win,53.518845
140,1992,Bill Clinton,Democratic,44909806,win,43.118485
144,1996,Bill Clinton,Democratic,47400125,win,49.296938
152,2000,George W. Bush,Republican,50456002,win,47.974666
157,2004,George W. Bush,Republican,62040610,win,50.771824
162,2008,Barack Obama,Democratic,69498516,win,53.02351
168,2012,Barack Obama,Democratic,65915795,win,51.258484
173,2016,Donald Trump,Republican,62984828,win,46.407862


Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

Show all 'win' results between 1980 to 2000

In [71]:
filtered4 = df[(df['Result']=='win') & (df['Year'] >= 1980 ) & (df['Year'] <= 2000)]
filtered4

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
131,1980,Ronald Reagan,Republican,43903230,win,50.897944
133,1984,Ronald Reagan,Republican,54455472,win,59.023326
135,1988,George H. W. Bush,Republican,48886597,win,53.518845
140,1992,Bill Clinton,Democratic,44909806,win,43.118485
144,1996,Bill Clinton,Democratic,47400125,win,49.296938
152,2000,George W. Bush,Republican,50456002,win,47.974666


Show all 'Loss' results of Independent party

In [74]:
filtered5 = df[(df['Result']=='loss')]
filtered5

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
5,1832,Henry Clay,National Republican,484205,loss,37.603628
6,1832,William Wirt,Anti-Masonic,100715,loss,7.821583
7,1836,Hugh Lawson White,Whig,146109,loss,10.005985
...,...,...,...,...,...,...
176,2016,Hillary Clinton,Democratic,65853514,loss,48.521539
177,2016,Jill Stein,Green,1457226,loss,1.073699
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


We can select multiple criteria by creating multiple boolean Series and combining them using the `&` operator.

Show results of win with percentage less than 50%

In [75]:
filtered6 = df[(df['Result']=='win') & (df['%'] <= 50 )]
filtered6

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
16,1848,Zachary Taylor,Whig,1360235,win,47.309296
20,1856,James Buchanan,Democratic,1835140,win,45.30608
23,1860,Abraham Lincoln,Republican,1855993,win,39.699408
33,1876,Rutherford Hayes,Republican,4034142,win,48.471624
36,1880,James Garfield,Republican,4453337,win,48.369234
39,1884,Grover Cleveland,Democratic,4914482,win,48.884933
43,1888,Benjamin Harrison,Republican,5443633,win,47.858041
47,1892,Grover Cleveland,Democratic,5553898,win,46.121393
70,1912,Woodrow Wilson,Democratic,6296284,win,41.933422


Show all 'win' results between 1980 to 2000

In [76]:
filtered7 = df[(df['Result']=='win') & (df['Year'] >= 1980 ) & (df['Year'] <= 2000)]
filtered7

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
131,1980,Ronald Reagan,Republican,43903230,win,50.897944
133,1984,Ronald Reagan,Republican,54455472,win,59.023326
135,1988,George H. W. Bush,Republican,48886597,win,53.518845
140,1992,Bill Clinton,Democratic,44909806,win,43.118485
144,1996,Bill Clinton,Democratic,47400125,win,49.296938
152,2000,George W. Bush,Republican,50456002,win,47.974666


## Loc and iLoc

Show 5 enteries from start

In [79]:
df.loc[:4]
df.iloc[:5]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


You can provide `.loc` a list of row labels [0-5] and column labels ['Candidate','Party', 'Year'] as input to return a dataframe

In [80]:
df.loc[:4,['Candidate', 'Party','Result']]

Unnamed: 0,Candidate,Party,Result
0,Andrew Jackson,Democratic-Republican,loss
1,John Quincy Adams,Democratic-Republican,win
2,Andrew Jackson,Democratic,win
3,John Quincy Adams,National Republican,loss
4,Andrew Jackson,Democratic,win


Loc also supports **slicing** (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.

Use Slicing on Rows and Columns

In [84]:
sliced_df = df.loc[1:3, 'Candidate':'Result']
sliced_df

Unnamed: 0,Candidate,Party,Popular vote,Result
1,John Quincy Adams,Democratic-Republican,113142,win
2,Andrew Jackson,Democratic,642806,win
3,John Quincy Adams,National Republican,500897,loss


If we provide only a **single label** for the column argument, we get back a **Series**.

In [86]:
candidate_series = df.loc[:, 'Candidate']
candidate_series

0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object

If we want a data frame instead and don't want to use to_frame, we can provide a **list** containing the column name.

In [88]:
can = df.loc[:, ['Candidate']]
can

Unnamed: 0,Candidate
0,Andrew Jackson
1,John Quincy Adams
2,Andrew Jackson
3,John Quincy Adams
4,Andrew Jackson
...,...
177,Jill Stein
178,Joseph Biden
179,Donald Trump
180,Jo Jorgensen


If we give only one row but many column labels, we'll get back a **Series** corresponding to a row of the table. This new Series has a neat index, where **each entry is the name of the column** that the data came from.

In [89]:
row_series = df.loc[2, ['Year', 'Candidate', 'Party']]
row_series

Year                   1828
Candidate    Andrew Jackson
Party            Democratic
Name: 2, dtype: object

In [None]:
# Answer Here

If we omit the column argument altogether, the **default behavior is to retrieve all columns**.

In [99]:
all_columns_rows = df.loc[[1 ]]
all_columns_rows

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878


Specify Rows and Columns as List to retrive specific enteries

In [100]:
all_columns_selected_rows = df.loc[[1, 3]]
all_columns_selected_rows

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073


Boolean Series are also boolean arrays, so we can use the Boolean Array Selection from earlier using loc as well.

In [106]:
votes = df['Popular vote'] > 5000
selected_rows = df.loc[votes_greater_than_5000]
print(votes,'\n')
selected_rows

0      True
1      True
2      True
3      True
4      True
       ... 
177    True
178    True
179    True
180    True
181    True
Name: Popular vote, Length: 182, dtype: bool 



Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


## String-labeled Rows

Let's do a quick example using data with string-labeled rows instead of integer labeled rows, just to make sure we're really understanding loc.

Use mottos.csv file

In [132]:
df = pd.read_csv(r"C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1\pandas\mottos.csv", index_col='State')
df

Unnamed: 0_level_0,Motto,Translation,Language,Date Adopted
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
Alaska,North to the future,—,English,1967
Arizona,Ditat Deus,God enriches,Latin,1863
Arkansas,Regnat populus,The people rule,Latin,1907
California,Eureka (Εὕρηκα),I have found it,Greek,1849
Colorado,Nil sine numine,Nothing without providence.,Latin,"November 6, 1861"
Connecticut,Qui transtulit sustinet,He who transplanted sustains,Latin,"October 9, 1662"
Delaware,Liberty and Independence,—,English,1847
Florida,In God We Trust,—,English,1868
Georgia,"Wisdom, Justice, Moderation",—,English,1798


In [114]:
selected_rows = df.loc[['California', 'Texas']]
selected_rows

Unnamed: 0_level_0,Motto,Translation,Language,Date Adopted
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
California,Eureka (Εὕρηκα),I have found it,Greek,1849
Texas,Friendship,—,English,1930


Extract slice, can be specified using slice notation, even if the rows have string labels instead of integer labels.

### iloc

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. 'iloc' slicing is **exclusive**, just like standard Python slicing of numerical values.

Use iloc to extract first 3 rows and columns from elections DataFrame

In [117]:
first_three_rows_cols = df.iloc[0:3, 0:3]
first_three_rows_cols

Unnamed: 0,Year,Candidate,Party
0,1824,Andrew Jackson,Democratic-Republican
1,1824,John Quincy Adams,Democratic-Republican
2,1828,Andrew Jackson,Democratic


We will use both `loc` and `iloc` in the course. `loc` is generally preferred for a number of reasons, for example:

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know e.g., what column #17 represents.
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

## Handy Properties and Utility Functions for Series and DataFrames

The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. For example:

In [118]:
df = pd.read_csv(r"C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1\pandas\mottos.csv")
df

Unnamed: 0,State,Motto,Translation,Language,Date Adopted
0,Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
1,Alaska,North to the future,—,English,1967
2,Arizona,Ditat Deus,God enriches,Latin,1863
3,Arkansas,Regnat populus,The people rule,Latin,1907
4,California,Eureka (Εὕρηκα),I have found it,Greek,1849
5,Colorado,Nil sine numine,Nothing without providence.,Latin,"November 6, 1861"
6,Connecticut,Qui transtulit sustinet,He who transplanted sustains,Latin,"October 9, 1662"
7,Delaware,Liberty and Independence,—,English,1847
8,Florida,In God We Trust,—,English,1868
9,Georgia,"Wisdom, Justice, Moderation",—,English,1798


In [None]:
# Answer Here

Size of DataFrame

In [122]:
size = len(df)
print(size)


50


(50, 5)

The fact that the size is 250 means our data file is relatively small, with only 250 total entries.

Shape of DataFrame

In [124]:
shape = df.shape
shape

(50, 5)

Use describe function and extract the meaningful information from DataFrame

In [126]:
numerical = df.describe()
numerical


Unnamed: 0,State,Motto,Translation,Language,Date Adopted
count,50,50,49,50,50
unique,50,50,30,8,47
top,Alabama,Audemus jura nostra defendere,—,Latin,1893
freq,1,1,20,23,2


Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

We can get a direct reference to the index using .index.

In [123]:
index = df.index
index

RangeIndex(start=0, stop=50, step=1)

In [128]:
df.head(2)

Unnamed: 0,State,Motto,Translation,Language,Date Adopted
0,Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
1,Alaska,North to the future,—,English,1967


It turns out the columns also have an Index. We can access this index by using `.columns`.

In [129]:
columns_of_dataframe = df.columns
columns_of_dataframe

Index(['State', 'Motto', 'Translation', 'Language', 'Date Adopted'], dtype='object')

## Sorting and Value Counts

There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using `sort_values`.

In [136]:
sorted_df = df.sort_values(by='Date Adopted', ascending=True)

As mentioned before, all Data Frame methods return a copy and do **not** modify the original data structure, unless you set inplace to True.

If we want to sort in reverse order, we can set `ascending=False`.

In [138]:
df.sort_values('Date Adopted', ascending=False)

Unnamed: 0_level_0,Motto,Translation,Language,Date Adopted
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Washington,Al-ki or Alki,By and by,Chinook Jargon,—
West Virginia,Montani semper liberi,Mountaineers are always free,Latin,"September 26, 1863"
Connecticut,Qui transtulit sustinet,He who transplanted sustains,Latin,"October 9, 1662"
Ohio,"With God, all things are possible",—,English,"October 1, 1959"
Colorado,Nil sine numine,Nothing without providence.,Latin,"November 6, 1861"
Rhode Island,Hope,—,English,"May 4, 1664"
Utah,Industry,,English,"May 3, 1896"
Tennessee,Agriculture and Commerce,—,English,"May 24, 1802"
South Carolina,Dum spiro spero \nAnimis opibusque parati,"While I breathe, I hope \nReady in soul and re...",Latin,"May 22, 1777"
New Jersey,Liberty and prosperity,—,English,"March 26, 1928"


We can also use `sort_values` on Series objects.

In [139]:
df['Language'].sort_values().head(50)

State
Washington        Chinook Jargon
Wyoming                  English
New Jersey               English
New Hampshire            English
Nevada                   English
Nebraska                 English
Wisconsin                English
Pennsylvania             English
Rhode Island             English
South Dakota             English
Louisiana                English
Ohio                     English
Texas                    English
Iowa                     English
Tennessee                English
Illinois                 English
Alaska                   English
Indiana                  English
Florida                  English
Delaware                 English
Georgia                  English
Utah                     English
Minnesota                 French
California                 Greek
Hawaii                  Hawaiian
Maryland                 Italian
South Carolina             Latin
Vermont                    Latin
Oregon                     Latin
Virginia                   Latin
West

For Series, the `value_counts` method is often quite handy.

In [140]:
df['Language'].value_counts()

Latin             23
English           21
Greek              1
Hawaiian           1
Italian            1
French             1
Spanish            1
Chinook Jargon     1
Name: Language, dtype: int64

Also commonly used is the `unique` method, which returns **all unique values** as a numpy array.

In [141]:
df['Language'].unique()

array(['Latin', 'English', 'Greek', 'Hawaiian', 'Italian', 'French',
       'Spanish', 'Chinook Jargon'], dtype=object)

In [142]:
def fiba(n):
    if n < 2:
        return n
    else:
        return fiba(n-1) + fiba(n-2)



fiba(5)

5

# Thank you!