# Introduction to Pandas 2

Advanced Pandas syntax, string methods, and sorting.

In [58]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('fivethirtyeight')

## 1. Handy Utility Methods

**Example 1.1.** The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. 

In [59]:
mottos = pd.read_csv("mottos.csv", index_col = "State")

In [60]:
mottos.head()

Unnamed: 0_level_0,Motto,Translation,Language,Date Adopted
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
Alaska,North to the future,—,English,1967
Arizona,Ditat Deus,God enriches,Latin,1863
Arkansas,Regnat populus,The people rule,Latin,1907
California,Eureka (Εὕρηκα),I have found it,Greek,1849


In [29]:
mottos.shape

(50, 4)

In [24]:
mottos.size

200

In [33]:
mottos.describe()

Unnamed: 0,Motto,Translation,Language,Date Adopted
count,50,49,50,50
unique,50,30,8,47
top,Qui transtulit sustinet,—,Latin,1893
freq,1,20,23,2


**Note:** Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

**Example 1.2.** We can get a direct reference to the index using `.index`.

In [35]:
mottos.index

Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'],
      dtype='object', name='State')

In [36]:
mottos.index.name

'State'

**Example 1.3.** It turns out the columns also have an Index. We can access this index by using `.columns`.

In [37]:
mottos.columns

Index(['Motto', 'Translation', 'Language', 'Date Adopted'], dtype='object')

**Example 1.4.** We can also use `.sort_values` on series objects.

In [46]:
mottos['Language'].sort_values(ascending = False).head()

State
Montana          Spanish
Maine              Latin
West Virginia      Latin
Virginia           Latin
Vermont            Latin
Name: Language, dtype: object

**Example 1.5.** For series, the `.value_counts` method is often quite handy.

In [18]:
mottos['Language'].value_counts()

Latin             23
English           21
Spanish            1
Greek              1
Italian            1
Hawaiian           1
Chinook Jargon     1
French             1
Name: Language, dtype: int64

**Example 1.6.** Also commonly used is the `.unique` method, which returns all unique values as an array.

In [19]:
mottos['Language'].unique()

array(['Latin', 'English', 'Greek', 'Hawaiian', 'Italian', 'French',
       'Spanish', 'Chinook Jargon'], dtype=object)

**Example 1.7.** Store the languages in a list.

In [None]:
motto_languages = ...
motto_languages

## 2. String Methods

Let's load the North Carolina `babynames` dataframe.

In [48]:
babynames = pd.read_csv("babynames_nc.csv", index_col = 0)
babynames.head()

Unnamed: 0,State,Sex,Year,Name,Count
0,NC,F,1910,Mary,837
1,NC,F,1910,Annie,401
2,NC,F,1910,Ruth,235
3,NC,F,1910,Ethel,199
4,NC,F,1910,Elizabeth,191


**Example 2.1.** Find the most popular baby name in North Carolina in 2003.

In [72]:
...

<bound method DataFrame.sort_values of             Name  Count
68806    Madison    696
68807       Emma    687
68808      Emily    662
68809     Hannah    603
68810    Abigail    515
...          ...    ...
155424    Unique      5
155425   Vicente      5
155426  Wilfredo      5
155427    Yousef      5
155428    Zakary      5

[2447 rows x 2 columns]>

**Example 2.2.** Use the series method for strings `.str.startswith` to find baby names that start with "K".

In [74]:
...

Unnamed: 0,State,Sex,Year,Name,Count
60,NC,F,1910,Katie,69
103,NC,F,1910,Katherine,33
126,NC,F,1910,Kate,24
136,NC,F,1910,Kathleen,20
248,NC,F,1910,Kathryn,8
...,...,...,...,...,...
175506,NC,M,2019,Kyan,5
175507,NC,M,2019,Kyngston,5
175508,NC,M,2019,Kyon,5
175509,NC,M,2019,Kyren,5


In [26]:
k = ...
k.head()

Unnamed: 0,State,Sex,Year,Name,Count
46,NC,F,1910,Jessie,85
51,NC,F,1910,Josephine,78
58,NC,F,1910,Julia,70
87,NC,F,1910,Janie,43
123,NC,F,1910,Jennie,25


**Example 2.3.** Find out how many babies were born in North Carolina with your name in the year you were born.

In [76]:
...

Unnamed: 0,State,Sex,Year,Name,Count
130006,NC,M,1970,Gary,333


**Example 2.4.** There are a lot of string methods. Clcik [here](https://pandas.pydata.org/pandas-docs/version/0.23.4/text.html#concatenating-a-single-series-into-a-string) to view the documentation.

In [27]:
babynames[babynames["Name"].str.contains("ad")].head()

Unnamed: 0,State,Sex,Year,Name,Count
22,NC,F,1910,Gladys,116
62,NC,F,1910,Sadie,68
218,NC,F,1910,Madeline,10
267,NC,F,1910,Madge,7
294,NC,F,1910,Madie,6


In [77]:
babynames["Name"].str.split("ar")

0              [M, y]
1             [Annie]
2              [Ruth]
3             [Ethel]
4         [Elizabeth]
             ...     
175572      [Woodrow]
175573         [Wren]
175574         [Yair]
175575         [Yoel]
175576         [Zeus]
Name: Name, Length: 175577, dtype: object

**Example 2.5.** Write a line of code that creates a list (or Series or array) of all names that end with “ert”..

In [61]:
...

array(['Robert', 'Margert', 'Herbert', 'Albert', 'Hubert', 'Elbert',
       'Wilbert', 'Gilbert', 'Bert', 'Thelbert', 'Rupert', 'Hobert',
       'Delbert', 'Talbert', 'Halbert', 'Hilbert', 'Milbert', 'Hebert',
       'Hurbert', 'Norbert'], dtype=object)

## 3. Sort Names by Length

Suppose we want to sort all baby names in North Carolina by their length.

**Example 3.1.** Create a new series of only the lengths.

In [78]:
babyname_lengths = babynames["Name"].str.len()

**Example 3.2.** Then add that series to the dataframe as a column.

In [79]:
babynames["name_lengths"] = babyname_lengths
babynames.head(5)

Unnamed: 0,State,Sex,Year,Name,Count,name_lengths
0,NC,F,1910,Mary,837,4
1,NC,F,1910,Annie,401,5
2,NC,F,1910,Ruth,235,4
3,NC,F,1910,Ethel,199,5
4,NC,F,1910,Elizabeth,191,9


**Example 3.3.** Then sort by that column. Then drop that column.

In [80]:
babynames = babynames.sort_values(by = "name_lengths", ascending = False)
babynames.head()

Unnamed: 0,State,Sex,Year,Name,Count,name_lengths
67383,NC,F,2001,Maryelizabeth,5,13
71259,NC,F,2004,Marykatherine,7,13
63421,NC,F,1998,Maryelizabeth,5,13
97095,NC,M,1914,Christopher,10,11
149299,NC,M,1998,Christopher,971,11


**Example 3.3.** Then drop that column.

In [81]:
babynames = babynames.drop("name_lengths", axis = 'columns')
babynames.head()

Unnamed: 0,State,Sex,Year,Name,Count
67383,NC,F,2001,Maryelizabeth,5
71259,NC,F,2004,Marykatherine,7
63421,NC,F,1998,Maryelizabeth,5
97095,NC,M,1914,Christopher,10
149299,NC,M,1998,Christopher,971


In [90]:
babynames["Count"].describe()

count    175577.000000
mean         51.372503
std         142.605256
min           5.000000
25%           7.000000
50%          13.000000
75%          36.000000
max        3883.000000
Name: Count, dtype: float64

**Example 3.4.** Which name is the least popular?

In [103]:
...

67383     Maryelizabeth
63421     Maryelizabeth
14888       Wilhelmenia
96790       Christopher
158716      Constantine
              ...      
140958               Bo
140951               Al
145672               Bo
145753               Ty
124645               Ed
Name: Name, Length: 22558, dtype: object

**Example 3.5.** Drop the `dr_ea_count` column.

### 4. Generate a Sorted Index

**Example 4.1.** Let's start over by first scrambling the order of babynames.

In [75]:
babynames = babynames.sample(frac = 1)
babynames.head(5)

Unnamed: 0,State,Sex,Year,Name,Count,dr_ea_count
51378,NC,F,1987,Keri,31,0
57799,NC,F,1993,Darlene,6,0
97324,NC,M,1915,Carl,177,0
165197,NC,M,2012,Aiden,451,0
141421,NC,M,1988,Karl,11,0
...,...,...,...,...,...,...
71558,NC,F,2004,Keonna,5,0
159899,NC,M,2007,Kyan,6,0
132731,NC,M,1974,Ira,11,0
171055,NC,M,2016,Brenton,9,0


**Example 4.2.** Another approach is to take advantage of the fact that `.loc` can accept an index. That is:

- `df.loc[idx]` returns df with its rows in the same order as the given index.
- Only works if the index exactly matches the DataFrame.

The first step was to create a sequence of the lengths of the names.

In [67]:
name_lengths = 
name_lengths.head(5)

88470     5
40904     6
122333    7
9871      4
95625     5
Name: Name, dtype: int64

**Example 4.3.** The next step is to sort the new series we just created.

In [68]:
name_lengths_sorted_by_length = name_lengths.sort_values()
name_lengths_sorted_by_length.head(5)

158709    2
104984    2
126376    2
29314     2
115312    2
Name: Name, dtype: int64

**Example 4.4.** Next, we pass the index of the sorted series to the `loc` method of the original dataframe.

In [69]:
index_sorted_by_length = name_lengths_sorted_by_length.index
index_sorted_by_length

Int64Index([158709, 104984, 126376,  29314, 115312, 168446, 106702,  15528,
             22167, 168159,
            ...
             51460,  96790, 128827, 158836, 117595,  21988, 141734,  67383,
             71259,  63421],
           dtype='int64', length=175577)

In [70]:
babynames.loc[index_sorted_by_length].head()

Unnamed: 0,State,Sex,Year,Name,Count,dr_ea_count
158709,NC,M,2006,Bo,5,0
104984,NC,M,1927,Al,7,0
126376,NC,M,1963,Ed,7,0
29314,NC,F,1959,Jo,116,0
115312,NC,M,1945,Al,11,0


**Example 4.5.** See if you can do the previous steps all in one line.

In [71]:
...

Unnamed: 0,State,Sex,Year,Name,Count,dr_ea_count
158709,NC,M,2006,Bo,5,0
104984,NC,M,1927,Al,7,0
126376,NC,M,1963,Ed,7,0
29314,NC,F,1959,Jo,116,0
115312,NC,M,1945,Al,11,0
