In [16]:
import pandas as pd # A general purpose Python library for data analysis
import numpy as np # A library for scientific computing in Python (e.g., provides high-performance multi-dimensional array objects and operations)

import matplotlib.pyplot as plt # a plotting library for Python and NumPy (readily customizable)
import seaborn as sns # Another plotting library for Python (fewer syntax, excellent default themes, behind the scenes, it uses matplotlib)
import time

In [17]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Knowledge Streams 2024

In this notebook, we will learn about the key data structures provided by the Pandas library: **Data Frames, Series, and Indices**.

In addition, we will learn about the following operations:
* How to access data contained in these structures?
* How to read files (e.g., csv, xlsx, sql) to create these structures?
* How to carry out different data manipulation tasks using these structures?

`Dataset`: US elections with information about candidates, their party, votes won, year of election and the result.

## Reading in Data Frames from Files
We'll be using **read_csv** today. Note that this file reading function does all the *data parsing* for you, which is very useful.

Before loading a file into a dataframe, let's first take a look at the **elections.csv** file

In [18]:
#Load csv file and print shape
# Code here
data=pd.read_csv("/content/drive/MyDrive/Pandas Work/elections.csv")
shape=data.shape
print(shape)
# how many observation and features are given
#Code here
print("Here are ",shape[0],"observations and ",shape[1],"features")

(182, 6)
Here are  182 observations and  6 features


In [19]:
# We can use the **head command** to show only a few rows of a dataframe from start.
# Code here
print(data.head())
#Use **tail command** to show last few observation.
print(data.tail())
# code here

   Year          Candidate                  Party  Popular vote Result  \
0  1824     Andrew Jackson  Democratic-Republican        151271   loss   
1  1824  John Quincy Adams  Democratic-Republican        113142    win   
2  1828     Andrew Jackson             Democratic        642806    win   
3  1828  John Quincy Adams    National Republican        500897   loss   
4  1832     Andrew Jackson             Democratic        702735    win   

           %  
0  57.210122  
1  42.789878  
2  56.203927  
3  43.796073  
4  54.574789  
     Year       Candidate        Party  Popular vote Result          %
177  2016      Jill Stein        Green       1457226   loss   1.073699
178  2020    Joseph Biden   Democratic      81268924    win  51.311515
179  2020    Donald Trump   Republican      74216154   loss  46.858542
180  2020    Jo Jorgensen  Libertarian       1865724   loss   1.177979
181  2020  Howard Hawkins        Green        405035   loss   0.255731


In [20]:
#The `read_csv` command lets us specify a **column to use an index**. For example, we could have used __Year__ as the index.
#Code here
data1=pd.read_csv("/content/drive/MyDrive/Pandas Work/elections.csv",index_col="Year")
data1.head()

Unnamed: 0_level_0,Candidate,Party,Popular vote,Result,%
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
1828,Andrew Jackson,Democratic,642806,win,56.203927
1828,John Quincy Adams,National Republican,500897,loss,43.796073
1832,Andrew Jackson,Democratic,702735,win,54.574789


In [21]:
#Alternately, we could have used the **set_index** commmand on the dataframe to set a particular column as index.
# code here
data.set_index("Year")

Unnamed: 0_level_0,Candidate,Party,Popular vote,Result,%
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
1828,Andrew Jackson,Democratic,642806,win,56.203927
1828,John Quincy Adams,National Republican,500897,loss,43.796073
1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...
2016,Jill Stein,Green,1457226,loss,1.073699
2020,Joseph Biden,Democratic,81268924,win,51.311515
2020,Donald Trump,Republican,74216154,loss,46.858542
2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


# Caution:
The **set_index command** (along with all other data frame methods) **does not modify the dataframe**, i.e., the original "elections" is untouched. Note: There is a flag called "inplace" which does modify the calling dataframe (e.g., `elections.set_index("Party",inplace=True)`).

## Duplicate Columns?
By contast, column names MUST be unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically any duplicates. Load duplicate_columns.csv

In [22]:
#Answer Here
duplicate_colums=pd.read_csv("/content/drive/MyDrive/DataScience_with_KS/duplicate_columns.csv")
duplicate_colums.head()

Unnamed: 0,name,name.1,flavor
0,john,smith,vanilla
1,zhang,shan,chocolate
2,fulan,alfulani,strawberry
3,hong,gildong,banana


## The [ ] Operator & Indexing

The DataFrame class has an indexing operator **[ ]** (also known as the 'brack' operator) that lets you do a variety of different things. If your provide a String to the **[ ]** operator, you get back a ***Series*** corresponding to the requested label.

1.Use **[ ]** to display different columns

2.Use List retrive multiple columns

In [23]:
# Answer Here
res=data[["Year"]]
print(res)
col=["Candidate","Party"]
data[col]


     Year
0    1824
1    1824
2    1828
3    1828
4    1832
..    ...
177  2016
178  2020
179  2020
180  2020
181  2020

[182 rows x 1 columns]


Unnamed: 0,Candidate,Party
0,Andrew Jackson,Democratic-Republican
1,John Quincy Adams,Democratic-Republican
2,Andrew Jackson,Democratic
3,John Quincy Adams,National Republican
4,Andrew Jackson,Democratic
...,...,...
177,Jill Stein,Green
178,Joseph Biden,Democratic
179,Donald Trump,Republican
180,Jo Jorgensen,Libertarian


The **[ ]** operator also accepts a list of strings. In this case, you get back a **DataFrame** corresponding to the requested strings.

In [24]:
# Answer Here
col=data[["Candidate","Party"]]
col.head()


Unnamed: 0,Candidate,Party
0,Andrew Jackson,Democratic-Republican
1,John Quincy Adams,Democratic-Republican
2,Andrew Jackson,Democratic
3,John Quincy Adams,National Republican
4,Andrew Jackson,Democratic


A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

Note that we can also use the **to_frame** method to turn a Series into a DataFrame.

Extract one col name "Candidates" from DataFrame it will be a series. Convert series into a DataFrame.

In [25]:
# Answer Here
ser=data.Year
ser_df=ser.to_frame()
ser_df

Unnamed: 0,Year
0,1824
1,1824
2,1828
3,1828
4,1832
...,...
177,2016
178,2020
179,2020
180,2020


In [26]:
# Answer Here
print(ser.index)
print(ser.values)

RangeIndex(start=0, stop=182, step=1)
[1824 1824 1828 1828 1832 1832 1832 1836 1836 1836 1840 1840 1844 1844
 1848 1848 1848 1852 1852 1852 1856 1856 1856 1860 1860 1860 1860 1864
 1864 1868 1868 1872 1872 1876 1876 1880 1880 1880 1884 1884 1884 1884
 1888 1888 1888 1888 1892 1892 1892 1892 1896 1896 1896 1896 1900 1900
 1900 1904 1904 1904 1904 1904 1908 1908 1908 1908 1912 1912 1912 1912
 1912 1916 1916 1916 1916 1920 1920 1920 1920 1920 1924 1924 1924 1928
 1928 1928 1932 1932 1932 1932 1936 1936 1936 1936 1940 1940 1940 1944
 1944 1948 1948 1948 1948 1948 1948 1952 1952 1952 1956 1956 1956 1960
 1960 1964 1964 1968 1968 1968 1972 1972 1972 1976 1976 1976 1976 1976
 1976 1980 1980 1980 1980 1980 1984 1984 1984 1988 1988 1988 1988 1992
 1992 1992 1992 1992 1996 1996 1996 1996 1996 1996 1996 2000 2000 2000
 2000 2000 2004 2004 2004 2004 2004 2004 2008 2008 2008 2008 2008 2008
 2012 2012 2012 2012 2016 2016 2016 2016 2016 2016 2020 2020 2020 2020]


### Row Indexing

The `[]` operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

Extract few rows from DataFrame

In [27]:
# Answer Here
data[0::2]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
6,1832,William Wirt,Anti-Masonic,100715,loss,7.821583
8,1836,Martin Van Buren,Democratic,763291,win,52.272472
...,...,...,...,...,...,...
172,2016,Darrell Castle,Constitution,203091,loss,0.149640
174,2016,Evan McMullin,Independent,732273,loss,0.539546
176,2016,Hillary Clinton,Democratic,65853514,loss,48.521539
178,2020,Joseph Biden,Democratic,81268924,win,51.311515


If you provide a single argument to the `[]` operator, it tries to use it as a name. This is true even if the argument passed to **[ ]** is an integer.

In [28]:
#elections[0] #this does not work, try uncommenting this to see it fail in action, woo
data[0:5]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


In [29]:
s=pd.Series(["One","Two","Three"],[1,2,3])
s1=pd.Series(["One","Two","Three"],index=[1,2,3])
s1.index=["One_","Two_","Three"]
s[s=="One"]

Unnamed: 0,0
1,One


The following cells allow you to **test your understanding**. Let's go over the summary of what we have learnt (see slides).

# Creating DataFrames
Create DataFrame using List and Columns name.

In [30]:
# Answer Here
df1=pd.DataFrame([1,2,3],columns=["Numbers"])
df1.head()
df1=pd.DataFrame([[1,"Ingrediants","Milk"],[2,"Responses","Yumy"]],columns=["Numbers","Values","Analysis"])
df1.head()


Unnamed: 0,Numbers,Values,Analysis
0,1,Ingrediants,Milk
1,2,Responses,Yumy


Creating DataFrames using **Dictionary**.

In [31]:
# Answer Here
df1=pd.DataFrame({"Numbers":[1,2,3]})
df1
df2=pd.DataFrame({"Numbers":[1,2],"Values":["Ingrediants","Responses"],"Analysis":["Milk","Yumy"]},index=[1,2])
df2

Unnamed: 0,Numbers,Values,Analysis
1,1,Ingrediants,Milk
2,2,Responses,Yumy


## Filtering via Boolean Array Selection

The `[]` operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a **filtered version of the data frame**, where **only rows corresponding to True appear**.

In [32]:
df2[[True,False]]

Unnamed: 0,Numbers,Values,Analysis
1,1,Ingrediants,Milk


In [68]:
s=pd.Series([1,2,3,4,5,6,6,7,8,9],index=[1,2,3,4,5,6,7,8,9,10])
df1=pd.DataFrame({"Integer":s})
df1[[True, False, True, False,
True, False, True, False, True, False]]


Unnamed: 0,Integer
1,1
3,3
5,5
7,6
9,8


One very common task in Data Science is **filtering**. Boolean Array Selection is one way to achieve this in Pandas. We start by observing that **logical operators** like the equality operator can be applied to **Pandas Series data** to generate a **Boolean Array**.

Compare the 'Result' column to the String 'win' and Show results

In [34]:
#Answer Here

ser_bol=data.Result=="win"
data[ser_bol].head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
8,1836,Martin Van Buren,Democratic,763291,win,52.272472
11,1840,William Henry Harrison,Whig,1275583,win,53.051213


Compare the 'Party' column to the String 'Democratic' and Show results

In [72]:
#Answer
res=data.Party=="Democratic"
data[res].head()
res=data.loc[data["Party"]=="Democratic"]
res

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
8,1836,Martin Van Buren,Democratic,763291,win,52.272472
10,1840,Martin Van Buren,Democratic,1128854,loss,46.948787
13,1844,James Polk,Democratic,1339570,win,50.749477
14,1848,Lewis Cass,Democratic,1223460,loss,42.552229
17,1852,Franklin Pierce,Democratic,1605943,win,51.013168
20,1856,James Buchanan,Democratic,1835140,win,45.30608
28,1864,George B. McClellan,Democratic,1812807,loss,45.048488
29,1868,Horatio Seymour,Democratic,2708744,loss,47.334695


The output of the logical operator applied to the Series is **another Series with the same name and index, but of datatype boolean**.

These boolean Series can be used as an argument to the `[]` operator.

Creates  DataFrame of all election winners since 1980.

In [36]:
#Answer Here
df_win=data.Result=="win"
since_1980=data.Year>1979
winners=data[df_win & since_1980].head()
# winners=data[(data.Result=="win") & (data.Year>1979)].head()
winners

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
131,1980,Ronald Reagan,Republican,43903230,win,50.897944
133,1984,Ronald Reagan,Republican,54455472,win,59.023326
135,1988,George H. W. Bush,Republican,48886597,win,53.518845
140,1992,Bill Clinton,Democratic,44909806,win,43.118485
144,1996,Bill Clinton,Democratic,47400125,win,49.296938


Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

Show all 'win' results between 1980 to 2000

In [37]:
#Answer Here
winners=data[(data.Result=="win") & (data.Year>1980) & (data.Year<2000)].head()
winners

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
133,1984,Ronald Reagan,Republican,54455472,win,59.023326
135,1988,George H. W. Bush,Republican,48886597,win,53.518845
140,1992,Bill Clinton,Democratic,44909806,win,43.118485
144,1996,Bill Clinton,Democratic,47400125,win,49.296938


Show all 'Loss' results of Independent party

In [38]:
# Answer Here
loss_ind=data[(data.Party=="Independent") & (data.Result=="loss")].head()
loss_ind

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
121,1976,Eugene McCarthy,Independent,740460,loss,0.911649
130,1980,John B. Anderson,Independent,5719850,loss,6.631143
143,1992,Ross Perot,Independent,19743821,loss,18.956298
161,2004,Ralph Nader,Independent,465151,loss,0.380663
167,2008,Ralph Nader,Independent,739034,loss,0.563842


We can select multiple criteria by creating multiple boolean Series and combining them using the `&` operator.

Show results of win with percentage less than 50%

In [39]:
# Answer Here
res=data.Result=="win"
perc=data["%"] < 50
win_less_50=data[res & perc]
win_less_50

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
16,1848,Zachary Taylor,Whig,1360235,win,47.309296
20,1856,James Buchanan,Democratic,1835140,win,45.30608
23,1860,Abraham Lincoln,Republican,1855993,win,39.699408
33,1876,Rutherford Hayes,Republican,4034142,win,48.471624
36,1880,James Garfield,Republican,4453337,win,48.369234
39,1884,Grover Cleveland,Democratic,4914482,win,48.884933
43,1888,Benjamin Harrison,Republican,5443633,win,47.858041
47,1892,Grover Cleveland,Democratic,5553898,win,46.121393
70,1912,Woodrow Wilson,Democratic,6296284,win,41.933422


Show all 'win' results between 1980 to 2000

In [40]:
# Answer Here
res=data[(data.Result=="win") & (data.Year>1980) & (data.Year<2000)]
res

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
133,1984,Ronald Reagan,Republican,54455472,win,59.023326
135,1988,George H. W. Bush,Republican,48886597,win,53.518845
140,1992,Bill Clinton,Democratic,44909806,win,43.118485
144,1996,Bill Clinton,Democratic,47400125,win,49.296938


## Loc and iLoc

Show 5 enteries from start

In [41]:
# Answer Here
data.loc[:4]
data.head(5)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


You can provide `.loc` a list of row labels [0-5] and column labels ['Candidate','Party', 'Year'] as input to return a dataframe

In [42]:
#Answer Here
data.loc[[0,1,2,3,4],["Candidate","Party","Year"]]

Unnamed: 0,Candidate,Party,Year
0,Andrew Jackson,Democratic-Republican,1824
1,John Quincy Adams,Democratic-Republican,1824
2,Andrew Jackson,Democratic,1828
3,John Quincy Adams,National Republican,1828
4,Andrew Jackson,Democratic,1832


Loc also supports **slicing** (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.

Use Slicing on Rows and Columns

In [43]:
# Answer Here
data.loc[0:,"Candidate":"%"]

Unnamed: 0,Candidate,Party,Popular vote,Result,%
0,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,Andrew Jackson,Democratic,642806,win,56.203927
3,John Quincy Adams,National Republican,500897,loss,43.796073
4,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...
177,Jill Stein,Green,1457226,loss,1.073699
178,Joseph Biden,Democratic,81268924,win,51.311515
179,Donald Trump,Republican,74216154,loss,46.858542
180,Jo Jorgensen,Libertarian,1865724,loss,1.177979


If we provide only a **single label** for the column argument, we get back a **Series**.

In [44]:
# Answer Here
str1=data.loc[:,"Candidate"]
str1

Unnamed: 0,Candidate
0,Andrew Jackson
1,John Quincy Adams
2,Andrew Jackson
3,John Quincy Adams
4,Andrew Jackson
...,...
177,Jill Stein
178,Joseph Biden
179,Donald Trump
180,Jo Jorgensen


If we want a data frame instead and don't want to use to_frame, we can provide a **list** containing the column name.

In [45]:
# Answer Here
data[["Candidate"]]

Unnamed: 0,Candidate
0,Andrew Jackson
1,John Quincy Adams
2,Andrew Jackson
3,John Quincy Adams
4,Andrew Jackson
...,...
177,Jill Stein
178,Joseph Biden
179,Donald Trump
180,Jo Jorgensen


If we give only one row but many column labels, we'll get back a **Series** corresponding to a row of the table. This new Series has a neat index, where **each entry is the name of the column** that the data came from.

In [46]:
# Answer Here
data.loc[1,:]

Unnamed: 0,1
Year,1824
Candidate,John Quincy Adams
Party,Democratic-Republican
Popular vote,113142
Result,win
%,42.789878


In [47]:
# Answer Here

If we omit the column argument altogether, the **default behavior is to retrieve all columns**.

In [48]:
# Answer Here
data.loc[1]

Unnamed: 0,1
Year,1824
Candidate,John Quincy Adams
Party,Democratic-Republican
Popular vote,113142
Result,win
%,42.789878


Specify Rows and Columns as List to retrive specific enteries

In [49]:
# Answer Here
data.loc[[1,2,3,4],["Candidate","Year"]]

Unnamed: 0,Candidate,Year
1,John Quincy Adams,1824
2,Andrew Jackson,1828
3,John Quincy Adams,1828
4,Andrew Jackson,1832


Boolean Series are also boolean arrays, so we can use the Boolean Array Selection from earlier using loc as well.

In [50]:
# Answer Here
ser=data.Result=="win"
data.loc[ser]
data.loc[data.Result=="win",["Candidate","Party"]].head()

Unnamed: 0,Candidate,Party
1,John Quincy Adams,Democratic-Republican
2,Andrew Jackson,Democratic
4,Andrew Jackson,Democratic
8,Martin Van Buren,Democratic
11,William Henry Harrison,Whig


## String-labeled Rows

Let's do a quick example using data with string-labeled rows instead of integer labeled rows, just to make sure we're really understanding loc.

Use mottos.csv file

In [51]:
# Answer Here
df=pd.read_csv("/content/drive/MyDrive/DataScience_with_KS/mottos.csv",index_col="State")
df.loc[["Alabama","California"],["Motto","Language"]]
df.loc["Alabama":"California","Motto":"Language"]

Unnamed: 0_level_0,Motto,Translation,Language
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin
Alaska,North to the future,—,English
Arizona,Ditat Deus,God enriches,Latin
Arkansas,Regnat populus,The people rule,Latin
California,Eureka (Εὕρηκα),I have found it,Greek


Extract slice, can be specified using slice notation, even if the rows have string labels instead of integer labels.

### iloc

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. 'iloc' slicing is **exclusive**, just like standard Python slicing of numerical values.

Use iloc to extract first 3 rows and columns from elections DataFrame

In [52]:
#Answer Here
data.iloc[:3,0:]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927


We will use both `loc` and `iloc` in the course. `loc` is generally preferred for a number of reasons, for example:

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know e.g., what column #17 represents.
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

## Handy Properties and Utility Functions for Series and DataFrames

The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. For example:

In [53]:
mottos = pd.read_csv("/content/drive/MyDrive/DataScience_with_KS/mottos.csv")

In [54]:
# Answer Here
print(mottos.shape)
mottos.head()

(50, 5)


Unnamed: 0,State,Motto,Translation,Language,Date Adopted
0,Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
1,Alaska,North to the future,—,English,1967
2,Arizona,Ditat Deus,God enriches,Latin,1863
3,Arkansas,Regnat populus,The people rule,Latin,1907
4,California,Eureka (Εὕρηκα),I have found it,Greek,1849


Size of DataFrame

In [55]:
# Answer Here
mottos.size

250

The fact that the size is 250 means our data file is relatively small, with only 250 total entries.

Shape of DataFrame

In [56]:
# Answer Here
mottos.shape

(50, 5)

Use describe function and extract the meaningful information from DataFrame

In [57]:
# Answer Here
stats=mottos.describe()
stats

Unnamed: 0,State,Motto,Translation,Language,Date Adopted
count,50,50,49,50,50
unique,50,50,30,8,47
top,Alabama,Audemus jura nostra defendere,—,Latin,1893
freq,1,1,20,23,2


Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

We can get a direct reference to the index using .index.

In [58]:
stats.index[1]


'unique'

In [59]:

mottos.head(2)

Unnamed: 0,State,Motto,Translation,Language,Date Adopted
0,Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
1,Alaska,North to the future,—,English,1967


It turns out the columns also have an Index. We can access this index by using `.columns`.

In [60]:
# Answer Here
mottos.columns[1]

'Motto'

## Sorting and Value Counts

There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using `sort_values`.

In [61]:
# Answer Here

As mentioned before, all Data Frame methods return a copy and do **not** modify the original data structure, unless you set inplace to True.

If we want to sort in reverse order, we can set `ascending=False`.

In [62]:
data.sort_values('%', ascending=False)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
114,1964,Lyndon Johnson,Democratic,43127041,win,61.344703
91,1936,Franklin Roosevelt,Democratic,27752648,win,60.978107
120,1972,Richard Nixon,Republican,47168710,win,60.907806
79,1920,Warren Harding,Republican,16144093,win,60.574501
133,1984,Ronald Reagan,Republican,54455472,win,59.023326
...,...,...,...,...,...,...
165,2008,Cynthia McKinney,Green,161797,loss,0.123442
148,1996,John Hagelin,Natural Law,113670,loss,0.118219
160,2004,Michael Peroutka,Constitution,143630,loss,0.117542
141,1992,Bo Gritz,Populist,106152,loss,0.101918


We can also use `sort_values` on Series objects.

In [63]:
mottos['Language'].sort_values().head(50)

Unnamed: 0,Language
46,Chinook Jargon
49,English
29,English
28,English
27,English
26,English
48,English
37,English
38,English
40,English


For Series, the `value_counts` method is often quite handy.

In [64]:
mottos['Language'].value_counts()

Unnamed: 0_level_0,count
Language,Unnamed: 1_level_1
Latin,23
English,21
Greek,1
Hawaiian,1
Italian,1
French,1
Spanish,1
Chinook Jargon,1


Also commonly used is the `unique` method, which returns **all unique values** as a numpy array.

In [65]:
mottos['Language'].unique()

array(['Latin', 'English', 'Greek', 'Hawaiian', 'Italian', 'French',
       'Spanish', 'Chinook Jargon'], dtype=object)

In [66]:
def fiba(n):
    if n < 2:
        return n
    else:
        return fiba(n-1) + fiba(n-2)
fiba(5)

5

# Thank you!