# Extra Content

In [1]:
import pandas as pd
import numpy as np

# Python: One More Data Structure  

## Collections Data Structures (standard):


Data Structure| Desc
----|------|
Lists| Heterogeneous **ordered** sequence of elements|
Tuples|Heterogeneous **Immutable ordered** sequence of elements
Dictionaries| Unordered collection stored as **Key-Value** pair
Set| Unordered collection of **unique** elements 

## Collections Data Structures (Additional):

Data Structure| Desc
----|------|
Numpy Arrays| Homogeneous sequence of elements in N-dimensional (arrays, matrices operations)|
Pandas Series| One dimensional **labeled** indexed array 
Pandas DataFrame| Multi-Index two-dimensional array (rows and columns)



# Set - Unordered collections of unique elements

Revisiting Lists

In [2]:
# Example of a list
names = ["Melvin", "Jack", "Smith", "Susan", "Samantha", "Mary", "Smith", "Melvin"]
for person in names:
    print(person, end=" ")

Melvin Jack Smith Susan Samantha Mary Smith Melvin 

When we convert the `names` list to a `set` we notice that duplicate values are removed.

In [3]:
# Convert a list to a set with set()
name_set = set(names)
print(name_set)

{'Jack', 'Samantha', 'Melvin', 'Smith', 'Susan', 'Mary'}


#### More `set` examples
Two lists `class_data` and `class_coding`. `class_data` has duplicate values/names.

In [4]:
class_data = ["Melvin", "Jack", "Smith", "Susan", "Samantha", "Mary", "Smith", "Melvin"]
class_coding = ["Jack", "Smith", "Ali", "Anish", "Hilary", "Cesar"]

print(f"number of students in Class Data = {len(class_data)}\nnumber of students in Class Coding = {len(class_coding)}")

number of students in Class Data = 8
number of students in Class Coding = 6


Casting the two lists to sets using `set()`

In [6]:
class_data = set(class_data)
class_coding = set(class_coding)

print(f"number of students in Class Data = {len(class_data)}\nnumber of students in Class Coding = {len(class_coding)}")


number of students in Class Data = 6
number of students in Class Coding = 6


### Set Operations: `union`, `intersection` and `difference`

In [7]:
combine_uniqe = set.union(class_data, class_coding)
combine_uniqe

{'Ali',
 'Anish',
 'Cesar',
 'Hilary',
 'Jack',
 'Mary',
 'Melvin',
 'Samantha',
 'Smith',
 'Susan'}

In [8]:
class_data.union(class_coding)

{'Ali',
 'Anish',
 'Cesar',
 'Hilary',
 'Jack',
 'Mary',
 'Melvin',
 'Samantha',
 'Smith',
 'Susan'}

In [9]:
class_intersection = set.intersection(class_data, class_coding)
class_intersection

{'Jack', 'Smith'}

In [10]:
class_data.intersection(class_coding)

{'Jack', 'Smith'}

In [11]:
class_data.difference(class_coding)

{'Mary', 'Melvin', 'Samantha', 'Susan'}

## Unpacking
#### Examples

In [12]:
# with Sets
first_name, last_name = ["Jack", "Smith"]
print(first_name)
print(last_name)

Jack
Smith


In [13]:
# with Tuples
first_name, last_name = ("Jack", "Smith")
print(first_name)
print(last_name)

Jack
Smith


In [14]:
x, y = 23, 50
x

23

In [15]:
grade, name = [23, "smith"]
grade

23

In [19]:
store, sales = ["0012", [23,45,6,19,90]]

print(f"{store} store had a total sale of ${sum(sales)}")

0012 store had a total sale of $183


In [20]:
def multiply(x,y):
    """
    x: numeric value
    y: numeric value
    returns x * y
    """
    return x*y

multiply(3,4)

12

The function `multiply` expects two variables (two paremeters) and if we pass it a list (one object) we will get an error.

In [21]:
multiply([5,6])

TypeError: multiply() missing 1 required positional argument: 'y'

In [22]:
multiply(*[5,6])

30

**Note** using the `*` to unpack the list into it's component values. In this case the `multiply` function expects three arguemtns/objects to be passed, and using `*` will unpack the list into three values representing `x`, `y` and `z`

In [23]:
def multiply(x,y,z):
    return x*y*z

multiply(*[2,3,4])

24

## `zip()` Function
The `zip()` function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together etc.

In [31]:
students = ["James", "Smith", "Mark", "Mike", "Justing"]
score = [98,100,80,79,88]
grade = ["A+", "A+", "B-", "C+", "B+"]

class_combined = list(zip(students, score, grade))

print(class_combined)

[('James', 98, 'A+'), ('Smith', 100, 'A+'), ('Mark', 80, 'B-'), ('Mike', 79, 'C+'), ('Justing', 88, 'B+')]


In [25]:
class_combined[0]

('James', 98, 'A+')

**I can pass convert the `zip` object to a Pandas DataFrame**

In [26]:
df = pd.DataFrame(class_combined)
df

Unnamed: 0,0,1,2
0,James,98,A+
1,Smith,100,A+
2,Mark,80,B-
3,Mike,79,C+
4,Justing,88,B+


In [27]:
df = pd.DataFrame(class_combined, columns=["Name", "Score", "Grade"])
df

Unnamed: 0,Name,Score,Grade
0,James,98,A+
1,Smith,100,A+
2,Mark,80,B-
3,Mike,79,C+
4,Justing,88,B+


# For Loops One More time: Looping with `Enumerate()`

In [32]:
# To get index and value from a list we can use enumerate
names = ["James", "Smith", "Mark", "Mike", "Justing"]

for i, name in enumerate(names):
    print(i, name)

0 James
1 Smith
2 Mark
3 Mike
4 Justing


In [33]:
# We can change inital starting point for enumerate, default is zero
for i, name in enumerate(names, 1):
    print(i, name)

1 James
2 Smith
3 Mark
4 Mike
5 Justing


In [34]:
indx = names.index("Mark")
indx

2

# Pands Creating a DataFrame with `read_clipboard()`

In [35]:
import webbrowser
website = "https://en.wikipedia.org/wiki/List_of_all-time_NFL_win–loss_records"
webbrowser.open(website)

True

In [36]:
df_wikipedia = pd.read_clipboard()

In [37]:
type(df_wikipedia)

pandas.core.frame.DataFrame

In [38]:
df_wikipedia.head()

Unnamed: 0,Rank,Team,GP,Won,Lost,Tied,Pct.,First NFL Season,Division
0,1,Dallas Cowboys,898,512,380,6,0.573,1960,NFC East
1,2,Chicago Bears,1386,761,583,42,0.564,1920,NFC North
2,3,Green Bay Packers,1352,743,571,38,0.564,1921,NFC North
3,4,New England Patriots[b],900,500,391,9,0.561,1960,AFC East
4,5,Miami Dolphins,816,452,360,4,0.556,1966,AFC East


In [40]:
df_wikipedia["Team"].head()

0             Dallas Cowboys
1              Chicago Bears
2          Green Bay Packers
3    New England Patriots[b]
4             Miami Dolphins
Name: Team, dtype: object

In [39]:
df_wikipedia.Team.head()

0             Dallas Cowboys
1              Chicago Bears
2          Green Bay Packers
3    New England Patriots[b]
4             Miami Dolphins
Name: Team, dtype: object

In [41]:
# DataFrame Index is a sequence (range) from 0 to 31 similar to using range(0,32,1)
df_wikipedia.index

RangeIndex(start=0, stop=31, step=1)

In [42]:
df_wikipedia.set_index('Team', inplace=True)
df_wikipedia.head()

Unnamed: 0_level_0,Rank,GP,Won,Lost,Tied,Pct.,First NFL Season,Division
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Dallas Cowboys,1,898,512,380,6,0.573,1960,NFC East
Chicago Bears,2,1386,761,583,42,0.564,1920,NFC North
Green Bay Packers,3,1352,743,571,38,0.564,1921,NFC North
New England Patriots[b],4,900,500,391,9,0.561,1960,AFC East
Miami Dolphins,5,816,452,360,4,0.556,1966,AFC East


In [43]:
# List DataFrame Index
df_wikipedia.index

Index(['Dallas Cowboys', 'Chicago Bears', 'Green Bay Packers',
       'New England Patriots[b]', 'Miami Dolphins', 'Minnesota Vikings',
       'Baltimore Ravens', 'New York Giants', 'Denver Broncos',
       'San Francisco 49ers', 'Indianapolis Colts[c]', 'Pittsburgh Steelers',
       'Kansas City Chiefs', 'Oakland Raiders', 'Seattle Seahawks',
       'Washington Redskins', 'Los Angeles Chargers[d]', 'Los Angeles Rams',
       'Carolina Panthers', 'Philadelphia Eagles', 'Cleveland Browns',
       'Tennessee Titans', 'Buffalo Bills', 'Detroit Lions',
       'Cincinnati Bengals', 'New Orleans Saints', 'New York Jets',
       'Houston Texans', 'Jacksonville Jaguars', 'Atlanta Falcons',
       'Arizona Cardinals'],
      dtype='object', name='Team')

In [44]:
df_wikipedia.loc["Dallas Cowboys"]

Rank                       1
GP                       898
Won                      512
Lost                     380
Tied                       6
Pct.                   0.573
First NFL Season        1960
Division            NFC East
Name: Dallas Cowboys, dtype: object

In [45]:
df_wikipedia.loc["Dallas Cowboys", "Won"]

512

In [48]:
df_wikipedia.tail()

Unnamed: 0_level_0,Rank,GP,Won,Lost,Tied,Pct.,First NFL Season,Division
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
New York Jets,27,900,401,491,8,0.45,1960,AFC East
Houston Texans,28,272,121,151,0,0.445,2002,AFC South
Jacksonville Jaguars,29,384,170,214,0,0.443,1995,AFC South
Atlanta Falcons,30,816,358,452,6,0.442,1966,NFC South
Arizona Cardinals,31,1346,553,753,40,0.426,1920,NFC West


**Another example using `read_clipboard()` with excel data**

In [55]:
df_excel = pd.read_clipboard()
df_excel.head()

Unnamed: 0,id,first_name,last_name,email,gender
0,1,David,Jordan,djordan0@home.pl,Male
1,2,Stephen,Riley,sriley1@hugedomains.com,Male
2,3,Evelyn,Grant,egrant2@livejournal.com,Female
3,4,Joe,Mendoza,jmendoza3@un.org,Male
4,5,Benjamin,Rodriguez,brodriguez4@elpais.com,Male


In [50]:
df_excel.columns

Index(['id', 'first_name', 'last_name', 'email', 'gender'], dtype='object')

In [51]:
df_excel.index

RangeIndex(start=0, stop=32, step=1)

In [57]:
df_excel.first_name.head()

0       David
1     Stephen
2      Evelyn
3         Joe
4    Benjamin
Name: first_name, dtype: object

## Creating Series with an Index

In [61]:
gdp_per_capita = pd.Series([59939,8612,38214,44680,1980,39532,39827,9881,32038,44841,10846,29958], 
                           index=["United States", "China", "Japan", "Germany", "India", "United Kingdom", "France","Brazil", "Italy", "Canada", "Russia", "South Korea"])

In [62]:
gdp_per_capita

United States     59939
China              8612
Japan             38214
Germany           44680
India              1980
United Kingdom    39532
France            39827
Brazil             9881
Italy             32038
Canada            44841
Russia            10846
South Korea       29958
dtype: int64

In [63]:
gdp_per_capita.min()

1980

In [64]:
gdp_per_capita.idxmin()

'India'

In [65]:
gdp_per_capita["China"]

8612

In [66]:
gdp_per_capita[1]

8612

In [67]:
gdp_per_capita.shape

(12,)

In [73]:
gdp_per_capita.index

Index(['United States', 'China', 'Japan', 'Germany', 'India', 'United Kingdom',
       'France', 'Brazil', 'Italy', 'Canada', 'Russia', 'South Korea'],
      dtype='object')

In [69]:
gdp_per_capita.values

array([59939,  8612, 38214, 44680,  1980, 39532, 39827,  9881, 32038,
       44841, 10846, 29958])

In [71]:
type(gdp_per_capita.index)

pandas.core.indexes.base.Index

In [72]:
type(gdp_per_capita.values)

numpy.ndarray