---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-data-analysis/resources/0dhYG) course resource._

---

# The Series Data Structure

In [5]:
import pandas as pd
import numpy as np

In [2]:
#Lookup Documentation
pd.Series?

In [3]:
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)

0    Tiger
1     Bear
2    Moose
dtype: object

In [4]:
numbers = [1, 2, 3]
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

**How missing value is handled**

In [23]:
#Missing String value
animals = ['Tiger', 'Bear', None]
pd.Series(animals) #Convert to object datatype and None

0    Tiger
1     Bear
2     None
dtype: object

In [9]:
#Missing Numeric Value
numbers = [1, 2, None]
pd.Series(numbers) #Convert to float and NaN

0    1.0
1    2.0
2    NaN
dtype: float64

In [28]:
#NaN is NOT None
#NaN is a numeric form of None
np.nan == None

False

In [35]:
type(np.nan)

float

Check if is NaN

In [34]:
#Use np.isnan() to test
np.isnan(np.nan)

True

Pass index when creating Series

In [43]:
#Pass dict
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [41]:
s.index

Index(['Archery', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')

In [44]:
#Specify a list of index
s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])
s

India      Tiger
America     Bear
Canada     Moose
dtype: object

In [46]:
# When giving a dict and an index, index takes charge
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
#Only select given index, even empty
s

Golf      Scotland
Sumo         Japan
Hockey         NaN
dtype: object

# Querying a Series

In [83]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [50]:
#defult index hidden but always exist
s.iloc[3]

'South Korea'

In [63]:
#index label
s.loc['Golf']

'Scotland'

In [84]:
#Auto-recognition of index input
s[3]

'South Korea'

In [None]:
#Auto-recognition of label input
s['Golf']

In [85]:
#What if index and label are both int? 
sports = {99: 'Bhutan',
          100: 'Scotland',
          101: 'Japan',
          102: 'South Korea'}
s = pd.Series(sports)

try: 
    s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead
except KeyError:
    print("KeyError! Because unspecified slice only looks up labels, \
    if query and label are the same dtype ")
    

KeyError! Because unspecified slice only looks up labels,     if query and label are the same dtype 


**3 ways to iterate Series**

In [97]:
s = pd.Series(np.arange(10))
s

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [98]:
#Iterate through data - Traditional way
total = 0
for item in s:
    total+=item
print(total)

45


In [92]:
#Use numpy function to calculate
import numpy as np

total = np.sum(s) #Vectorization
print(total)

45


In [99]:
#Use pandas itself to calculate
s.sum() #Vectorization

45

In [100]:
#this creates a big series of random numbers
s = pd.Series(np.random.randint(0,1000,10000))
s.head()#display first 5 element

0    477
1    577
2    681
3    864
4    550
dtype: int64

In [101]:
len(s)

10000

Use `%%timeit -n x` to run the cell x times and time the averge

In [104]:
%%timeit -n 100 #notebook magic run 100 times and time average
summary = 0
for item in s:
    summary+=item

1.55 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [105]:
%%timeit -n 100
summary = np.sum(s)

129 µs ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [106]:
%%timeit -n 100
summary = s.sum() #fastest

83.9 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


**With broadcasting, you can apply an operation to every value in the series**

In [107]:
s+=2 #adds two to each item in s using broadcasting
s.head()

0    479
1    579
2    683
3    866
4    552
dtype: int64

Don't use unpacking in Pandas because it is very slow

In [111]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
    s.loc[label]= value+2

624 ms ± 27.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [110]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
s+=2


466 µs ± 186 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
s = pd.Series([1, 2, 3])
s.loc['Animal'] = 'Bears' #Create if non-exist
s

**Pandas allows replicate indexes**

In [None]:
original_sports = pd.Series({'Archery': 'Bhutan',
                             'Golf': 'Scotland',
                             'Sumo': 'Japan',
                             'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
                                      'Barbados',
                                      'Pakistan',
                                      'England'], 
                                   index=['Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Cricket'])


In [None]:
original_sports

In [None]:
cricket_loving_countries #multiple same index

In [None]:
#Use .append() to return a new series with appended data
all_countries = original_sports.append(cricket_loving_countries) 
all_countries

In [None]:
all_countries.loc['Cricket'] #returns another Series

# The DataFrame Data Structure

In [None]:
import pandas as pd

In [None]:

purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})
purchase_1

In [None]:
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], \
                  index=['Store 1', 'Store 1', 'Store 2']) #row index

df#.head()

*Note: use [ ] whenever refer to column lable, <br>use .loc [ ] or .iloc [ ] whenever for row.*

In [None]:
#.loc[] for row - label selection

df.loc['Store 2'] #Singel row -->Returns a Series(1D)

In [None]:
#.iloc[] for row - index selection
df.iloc[2] == df.loc['Store 2']

In [None]:
type(df.loc['Store 2'])

**Index can be Non-unique**:

In [None]:

df.loc['Store 1'] #multiple row -->returns a DataFrame

In [None]:
#Direct Selelction 
df.loc['Store 1', 'Cost']# support change of data

In [None]:
#Chaining selection, used with caution!
df.loc['Store 1']['Cost'] #return a copy of data

In [None]:
df.T

In [None]:
df.T.loc['Cost']

In [None]:
#Indexing operator exclusively for column selection.
df['Cost'] 

In [None]:
#Slicing
df.loc[:,['Name', 'Cost']] 

In [None]:
#remove data
df.drop('Store 1') #default on row, inplace = False

In [None]:
df

In [None]:
copy_df = df.copy()
copy_df.drop('Cost', axis = 1, inplace = True)
copy_df

In [None]:
copy_df.drop?

In [None]:
del copy_df['Name'] #inplace delete
copy_df

In [None]:
#Change / add data
df['Location'] = None
df

*Note: use [ ] whenever refer to column lable, <br>use .loc [ ] or .iloc [ ] whenever for row.*

# Dataframe Indexing and Loading

**Pay Attention to Pointers**

In [None]:
#Pointers
costs = df['Cost'] #change in costs changes df['Cost']

#if we don't want to affect original data
costs = df.copy['Cost']

costs

In [None]:
costs+=2 #df['Cost'] is changed at same time
costs

In [None]:
df

## Load Data 

`!` precedes a shell command

In [None]:
#Preview a file using !cat file

!cat olympics.csv #concatenate olympics.csv 

In [None]:
#option1 : 
#open file, use csv to read into dict, pass dict to DataFrame:
import csv
with open('olympics.csv') as csvfile:
    record_dict = list(csv.DictReader(csvfile))
record = pd.DataFrame(record_dict)#each dict form a Series 
record_dict 
record

In [None]:
#better option 2:
#Use Pandas's .read_csv() to read csv into a DataFrame
df = pd.read_csv('olympics.csv')
df

In [None]:
#the datatype of column and row index
print(df.columns) #columns index is str, specified by first line of .csv file
print(df.index) #row index is auto-generated integer, based on number of lines

**Specify some parameter to make better table**

In [None]:
#index_col: select which column as index (default None)
df = pd.read_csv('olympics.csv', index_col = 0)
df.head()

In [None]:
#skiprows to change row label
df = pd.read_csv('olympics.csv', index_col = 0, \
                skiprows = 1) #skip 1 row (Index 2nd row)
df.head()

In [None]:
#Examine the column index and we find some problem:
df.columns

#Pandas read unicode as x! and sufix .1, .2 to make them unique

In [None]:
#df.rename(columns/index = {old_index_name: new_index_name})
old = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
print('old_df: \n', old)
new = old.rename(index=str, columns={"A": "a", "B": "c"})
print('old_df: \n', new)

In [None]:
#Manually refine the column index

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold' + col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver' + col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#' + col[1:]}, inplace=True) 

df.head()

# Querying a DataFrame

**Understand Boolean Masking**

To build a Boolean mask for this query, we project the gold column using the indexing operator and apply the greater than operator with a comparison value of zero. This is essentially broadcasting a comparison operator, greater than, with the results being returned as a Boolean series.

In [None]:
#broadcasting and creating boolean mask
df['Gold'] > 0

.`where(mask)`takes a Boolean mask as a condition, applies it to the data frame or series, and returns a new data frame or series of the same shape.

In [None]:
only_gold = df.where(df['Gold'] > 0) #Same shape with NaN for inelgible data
only_gold.head()

In [None]:
#Most function igonores NaN
only_gold['Gold'].count()

In [None]:
df['Gold'].count()

In [None]:
# Use .dropna() to drop NaN data
only_gold = only_gold.dropna() #axis = 0 by default
only_gold.head()

In [None]:
#Shortcuts for where & drop
only_gold = df[df['Gold'] > 0]#df[[False, True, True..]]
only_gold.head()

In [None]:
#two Boolean masks being compared --> another Boolean mask
len(df[(df['Gold'] > 0) | (df['Gold.1'] > 0)]) #OR


In [None]:
#Underlying concept
pd.Series([True, False]) | pd.Series(([False, True]))# wrap mask in( ) for order of comparison)

In [None]:
df[(df['Gold.1'] > 0) & (df['Gold'] == 0)] #AND

Example:<br>Write a query to return all of the names of people who bought products worth more than $3.00.


In [None]:

purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], \
                  index=['Store 1', 'Store 1', 'Store 2'])

In [None]:
#Option 1
mask = df['Cost'] > 3.0
df = df.where(mask)
df = df.dropna()
df['Name']

In [None]:
#option 2
df[df['Cost'] > 3]['Name'] #mask, then select
#or,
df['Name'][df['Cost'] > 3] # select, then mask 

# Indexing Dataframes

In [None]:
df = pd.read_csv('olympics.csv', index_col = 0, skiprows = 1)

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold' + col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver' + col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#' + col[1:]}, inplace=True) 

df.head()

In [None]:
df['country'] = df.index #create new column 'country', and assign the value
country_index = df.index

In [None]:
df = df.set_index('Gold') #select 'Gold' as index
df.head() #old index is lost, unless saved before
#New index has the index name

In [None]:
print(df)

In [None]:
df = df.reset_index() #Move index back as a colunm, leaving auto index only
df.head()

In [None]:
#to recover the index 
df.set_index('country', inplace = True) #Use one the column
#or,
df.set_index(country_index, inplace = True) #Use an external list


**Multi-level Indexing**

In [None]:
df = pd.read_csv('census.csv')
df.head()

In [None]:
#Examine the scope of data
df['SUMLEV'].unique() #or set(df['SUMLEV'])


In [None]:
df=df[df['SUMLEV'] == 50] #filter shorcuts 
df.head()

In [None]:
#Reduce to our interested data
columns_to_keep = ['STNAME',
                   'CTYNAME',
                   'BIRTHS2010',
                   'BIRTHS2011',
                   'BIRTHS2012',
                   'BIRTHS2013',
                   'BIRTHS2014',
                   'BIRTHS2015',
                   'POPESTIMATE2010',
                   'POPESTIMATE2011',
                   'POPESTIMATE2012',
                   'POPESTIMATE2013',
                   'POPESTIMATE2014',
                   'POPESTIMATE2015']
df = df[columns_to_keep]
df.head()

In [None]:
#Multi-level indexing --> GROUP BY 
df = df.set_index(['STNAME', 'CTYNAME'])#Group by State, and then City
df.head()

In [None]:
#Query multi-level index
df.loc['Michigan', 'Washtenaw County']#[level 0, level 1, ..]
#or, 
df.loc[('Michigan', 'Washtenaw County')]

In [None]:
#Query multiple multi-indexes
df.loc[[('Michigan', 'Washtenaw County'),
         ('Michigan', 'Wayne County')]] #double []

In [None]:
#Select / change multi-level index element
print(df.index) #examine the data type: (n,2)
df.index[:][1] #all rows, and their 2nd-level index

Example: <br>Reindex the purchase records DataFrame to be indexed hierarchically, first by store, then by person. Name these indexes 'Location' and 'Name'. Then add a new entry to it with the value of:

Name: 'Kevyn', Item Purchased: 'Kitty Food', Cost: 3.00 Location: 'Store 2'.

In [None]:
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

eg = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
eg

In [None]:
#Solution:
eg = eg.set_index([eg.index, 'Name']) #Set index
eg

In [None]:
#Name index (inplace)
eg.index.names = ['Location', 'Name'] #Inplace, []or()or naked
#or
eg = eg.rename_axis(['Location', 'Name']) #returns new
#eg.index.names = ('Location', 'Name')#works
#eg.index.names = 'Location', 'Name' #works 
print(eg.index)
eg.index[1][0] #[rowindex][index level]


In [None]:
#Append New Series to existing DataFrame
eg = eg.append(pd.Series(data = {'Cost':3.00, 'Item Purchased': 'Kitty Food'}, \
                        name = ('Store 2', 'Kevyn'))) # name = (level 1, leve 2,..) 
eg

# Missing values

In [None]:
df = pd.read_csv('log.csv')
df

In [None]:
df.fillna?

In [None]:
#Sort by value
df.sort_values(by = 'time', axis = 0, ascending = True) #sort by 'time'


In [None]:
#sort by index
df = df.set_index('time')#set 'time' as index
df = df.sort_index() #sort by 'time'
df

In [None]:
df.set_index('user')

In [None]:
#Reset index before set new index, otherwise will lose old 
df = df.reset_index() #Put old index back as a column
df = df.set_index(['time', 'user']) #Set whatever index
df

In [None]:
#Use .fillna() to fill all NaN and None
df = df.fillna(method='ffill') #fill value one row above
df.head()

#can also fill with same length Series