
Welcome to Pandas for Data Science

Todays agenda:

    Introducing Pandas
    Comments, Variables
    Loading Data into Pandas
    Reading and working with Data in Pandas
    Writing/Exporting Data into Desired Format

    
Section 1: Introducing Pandas:
What is Pandas? (And why should you learn it?)
How can I:
Select individual values from a Pandas dataframe.
Select entire rows or entire columns from a dataframe.
Select a subset of both rows and columns from a dataframe in a single operation.

    Pandas is a powerful data manipulation and wrangling library for python

    A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

    What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

Section 2: COMMENTS:

    As in Python, comments are lines of a Python script that you want Python to ignore
    Comments are useful for writing notes to your future self about what you were thinking
    Simple one line comments start with a #

Section 3: SELECTING VALUES:

    

In [3]:
import pandas as pd

print ('pandas version', pd.__version__)

data = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_gdp_europe.csv')
store = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/cal_housing_small.csv')

print(data)

data.head(10) # First 10 lines of our data frame

# Ctrl + Shift + Enter = Specific Line Run
# Ctrl + Enter = All Lines Run

# To access a value at the position [i,j] of a DataFrame, we have two options, depending on what is the
# meaning of i in use.
# A DataFrame provides a index as a way to identify the rows of the table; a row, then,
# has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

#data[4,3]

data.iloc[4,3]

data.head(10)

data.loc[7,"gdpPercap_1972"]

print(data.loc[0, "gdpPercap_1952"])
print(data.iloc[0,1])

# DataFrame.loc can specify location by row name analogously to 2D version of dictionary keys.
# Use the country name as the index now in the class
data = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_gdp_europe.csv', index_col='country')
# See an example where the country name is the index
print(data.loc["Albania", "gdpPercap_1952"])

print("")

# Use DataFrame.iloc[..., ...] to select values by their (entry) position
# DataFrame.iloc can specify location by numerical index analogously to 2D version of integer character selection in strings.
mydata = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_gdp_europe.csv', index_col='country')
print(data.iloc[0, 0])

print("")
data.head(10)
# Use : on its own to mean all columns or all rows.
# Just like Python’s usual slicing notation.
print(data.loc["Albania", :]) # This means all the gdps or columns of the row "Albania"

print("")
# You would get the same result printing:
# row is the first argument of the .loc method
print(data.loc["Albania"]) # (without a second index or second item in the list).

# Print out the gdpPercap for all years in France
# Hint: Use the .loc method

print(data.loc[:, "gdpPercap_1952"]) # This means all rows of the column "gdpPercap_1952"

# You would also get the same result printing for single column names:
data["gdpPercap_1952"]
data["gdpPercap_1957"]
# Also get the same result printing
data.gdpPercap_1952 # (since it’s a single column name)
data.gdpPercap_1977

# Select multiple columns or rows using DataFrame.loc and a named slice.
print(data.iloc[9,4])

# Named slices can be defined like this
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])


# Using a named slice, how do we access the rows from Greece to Norway, and columns from gdpPercap_1977 to gdpPercap_2007?
data.loc["Greece":"Norway", "gdpPercap_1977":"gdpPercap_2007"]

# Using a numerical slice, how do we access the rows from Greece to Norway, and columns from gdpPercap_1977 to gdpPercap_2007?
data.iloc[11:18, 6:12]

data.iloc[11:18, 5:12]

# To get the index of a column, use:
data.columns.get_loc("gdpPercap_2007")

data.head(30)

#Numerical Slices in Pandas
print(data.iloc[3:9,0:4])

# IMPORTANT DIFFERENCE BETWEEN .loc and .iloc
# In the above code, we discover that slicing using loc is inclusive at both ends,
# which differs from slicing using iloc, where slicing indicates everything up to
# but not including the final index.

# Result of slicing can be used in further operations.
# Usually don’t just print a slice.
# All the statistical operators that work on entire dataframes work the same way on slices.
# E.g., calculate max of a slice.
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())

print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())

# Use comparisons to select data based on value.
# Comparison is applied element by element.
# Returns a similarly-shaped dataframe of True and False.
# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)

# Which values were greater than 10000 ?
print('\nWhere are values large greater than 10000?\n', subset > 10000)

# Select values or NaN using a Boolean mask.
# A frame full of Booleans is sometimes called a mask because of how it can be used.
mask = subset > 10000
print(mask)
print(subset[mask])

# Get the value where the mask is true, and NaN (Not a Number) where the mask is false.
# .describe: Useful because NaNs are ignored by operations like max, min, average, etc.
print(subset[subset > 10000].describe())

# Group By: split-apply-combine
# Pandas vectorizing methods and grouping operations are features that provide users
# much flexibility to analyse their data.

# For instance, let’s say we want to have a clearer view on how the European countries
# split themselves according to their GDP.

# We may have a glance by splitting the countries in two groups during the years surveyed,
# those who presented a GDP higher than the European average and those with a lower GDP.
# We then estimate a wealthy score based on the historical (from 1962 to 2007) values,
# where we account how many times
# a country has participated in the groups of lower or higher GDP
mask_higher = data > data.mean()
print(mask_higher)
# Hint: the index for your data frame should be your mask
print(data[mask_higher])
#
wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns)
print(len(data.columns))
wealth_score

# Finally, for each group in the wealth_score table, we sum their (financial)
# contribution across the years surveyed:
data.groupby(wealth_score).sum()

# Selection of Individual Values
# Assume Pandas has been imported into your notebook and the Gapminder GDP data for Europe has been loaded:
df = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_gdp_europe.csv', index_col='country')

# Write an expression to find the Per Capita GDP of Serbia in 2007.

# .loc lets you filter your dataframe by location name i.e. rowname / column name
# data.loc[#Rows, #Columns]
# data.loc[]

# The selection can be done by using the labels for both the row (“Serbia”)
# and the column (“gdpPercap_2007”):
# Return all rows and all columns
df.loc[:]

# Certain rows and certain columns
print(df.loc['Serbia', 'gdpPercap_2007'])
df.loc["Belgium","gdpPercap_1962"]

# All rows and certain columns
df.loc[:, "gdpPercap_1962"]

# Certain rows and all columns for one country
df.loc["Belgium",:]
# Multiple countries, all columns
df.loc[["Albania","Belgium"],:]
df.loc[["Serbia","France","Austria"],:]

# How would you return "gdpPercap_1962" and ""gdpPercap_1972" and "gdpPercap_1982" for Serbia, France and Austria?
# Hint: you need to use the .loc method and you need a list of rows and a list of columns as the arguments for your index
df.loc[["Serbia","France","Austria"],["gdpPercap_1962","gdpPercap_1972","gdpPercap_1982"]]

# Try this method:
data.loc[data.loc["Albania":"Bulgaria"].index,]
getind = data.loc["Albania":"Bulgaria"].index
data.loc[getind, list(data.filter(regex="1962|1972|1982").columns)]


# You cannot use a range of countries and gdp when using a list with .loc
# df.loc[["Serbia":"Austria"],["gdpPercap_1962":"gdpPercap_1982"]]

# One country, multiple columns
df.loc["Belgium",["gdpPercap_1962", "gdpPercap_2007"]]

df
# Rows Albania, Belgium, Bulgaria; all columns
df.loc[["Albania","Belgium","Bulgaria"]]
print(df.loc['Albania':'Bulgaria'])
print(df.loc["gdpPercap_1962":"gdpPercap_2007"])
df.loc["Albania":"Bulgaria",]
df.loc["Albania":"Bulgaria"]

# Here
data = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_gdp_europe.csv')
data = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_gdp_europe.csv', index_col='country')

data.head(15)
data.columns

data.iloc[7,5]
print(data.iloc[5,7])
print(data.loc["Albania","gdpPercap_1977"])

# Python counts from zero
data.iloc[row_index, column_index]

data.columns.get_loc("gdpPercap_2007")
data.row.get_loc("Albania")

import pandas as pd

print ('pandas version', pd.__version__)

# Use this data, GDP for countries in Europe
data = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_gdp_europe.csv')

# Colab
store = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/cal_housing_small.csv')


display(data)
print(data)



pandas version 2.1.4
                   country  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
0                  Albania     1601.056136     1942.284244     2312.888958   
1                  Austria     6137.076492     8842.598030    10750.721110   
2                  Belgium     8343.105127     9714.960623    10991.206760   
3   Bosnia and Herzegovina      973.533195     1353.989176     1709.683679   
4                 Bulgaria     2444.286648     3008.670727     4254.337839   
5                  Croatia     3119.236520     4338.231617     5477.890018   
6           Czech Republic     6876.140250     8256.343918    10136.867130   
7                  Denmark     9692.385245    11099.659350    13583.313510   
8                  Finland     6424.519071     7545.415386     9371.842561   
9                   France     7029.809327     8662.834898    10560.485530   
10                 Germany     7144.114393    10187.826650    12902.462910   
11                  Greece     3530.690067 

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Albania,1601.056136,1942.284244,2312.888958,2760.196931,3313.422188,3533.00391,3630.880722,3738.932735,2497.437901,3193.054604,4604.211737,5937.029526
Austria,6137.076492,8842.59803,10750.72111,12834.6024,16661.6256,19749.4223,21597.08362,23687.82607,27042.01868,29095.92066,32417.60769,36126.4927
Belgium,8343.105127,9714.960623,10991.20676,13149.04119,16672.14356,19117.97448,20979.84589,22525.56308,25575.57069,27561.19663,30485.88375,33692.60508
Bosnia and Herzegovina,973.533195,1353.989176,1709.683679,2172.352423,2860.16975,3528.481305,4126.613157,4314.114757,2546.781445,4766.355904,6018.975239,7446.298803
Bulgaria,2444.286648,3008.670727,4254.337839,5577.0028,6597.494398,7612.240438,8224.191647,8239.854824,6302.623438,5970.38876,7696.777725,10680.79282
