
Welcome to Pandas for Data Science
Todays agenda:

    Introducing Pandas
    Comments, Variables
    Loading Data into Pandas
    Reading and working with Data in Pandas
    Writing/Exporting Data into Desired Format

    
Section 1: Introducing Pandas:
What is Pandas? (And why should you learn it?)
How can I:
Select individual values from a Pandas dataframe.
Select entire rows or entire columns from a dataframe.
Select a subset of both rows and columns from a dataframe in a single operation.

    Pandas is a powerful data manipulation and wrangling library for python

    A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

    What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

Section 2: COMMENTS:

    As in Python, comments are lines of a Python script that you want Python to ignore
    Comments are useful for writing notes to your future self about what you were thinking
    Simple one line comments start with a #

Section 3: SELECTING VALUES:

    

In [None]:
import pandas as pd

print ('pandas version', pd.__version__)

data = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_gdp_europe.csv')
store = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/cal_housing_small.csv')

# Ctrl + Shift + Enter = Specific Line Run
# Ctrl + Enter = All Lines Run

# To access a value at the position [i,j] of a DataFrame, we have two options, depending on what is the
# meaning of i in use.
# A DataFrame provides a index as a way to identify the rows of the table; a row, then,
# has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

# DataFrame.loc can specify location by row name analogously to 2D version of dictionary keys.
data = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_gdp_europe.csv', index_col='country')
print(data.loc["Albania", "gdpPercap_1952"])

print("")
# Use DataFrame.iloc[..., ...] to select values by their (entry) position
# DataFrame.iloc can specify location by numerical index analogously to 2D version of integer character selection in strings.
mydata = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_gdp_europe.csv', index_col='country')
print(data.iloc[0, 0])

print("")
# Use : on its own to mean all columns or all rows.
# Just like Python’s usual slicing notation.
print(data.loc["Albania", :]) # This means all columns of the row "Albania"

print("")
# You would get the same result printing:
print(data.loc["Albania"]) # (without a second index or second item in the list).

print(data.loc[:, "gdpPercap_1952"]) # This means all rows of the column "gdpPercap_1952"

# You would also get the same result printing:
data["gdpPercap_1952"]
# Also get the same result printing
data.gdpPercap_1952 # (since it’s a column name)

# Select multiple columns or rows using DataFrame.loc and a named slice.
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])

# IMPORTANT DIFFERENCE BETWEEN .loc and .iloc
# In the above code, we discover that slicing using loc is inclusive at both ends,
# which differs from slicing using iloc, where slicing indicates everything up to
# but not including the final index.

# Result of slicing can be used in further operations.
# Usually don’t just print a slice.
# All the statistical operators that work on entire dataframes work the same way on slices.
# E.g., calculate max of a slice.
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())

print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())

# Use comparisons to select data based on value.
# Comparison is applied element by element.
# Returns a similarly-shaped dataframe of True and False.
# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)

# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)

# Select values or NaN using a Boolean mask.
# A frame full of Booleans is sometimes called a mask because of how it can be used.
mask = subset > 10000
print(subset[mask])

# Get the value where the mask is true, and NaN (Not a Number) where it is false.
# Useful because NaNs are ignored by operations like max, min, average, etc.
print(subset[subset > 10000].describe())

# Group By: split-apply-combine
# Pandas vectorizing methods and grouping operations are features that provide users
# much flexibility to analyse their data.

# For instance, let’s say we want to have a clearer view on how the European countries
# split themselves according to their GDP.

# We may have a glance by splitting the countries in two groups during the years surveyed,
# those who presented a GDP higher than the European average and those with a lower GDP.
# We then estimate a wealthy score based on the historical (from 1962 to 2007) values,
# where we account how many times
# a country has participated in the groups of lower or higher GDP
mask_higher = data > data.mean()
wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns)
wealth_score

# Finally, for each group in the wealth_score table, we sum their (financial)
# contribution across the years surveyed:
data.groupby(wealth_score).sum()

# Selection of Individual Values
# Assume Pandas has been imported into your notebook and the Gapminder GDP data for Europe has been loaded:
df = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_gdp_europe.csv', index_col='country')

# Write an expression to find the Per Capita GDP of Serbia in 2007.

# .loc lets you filter your dataframe by location name i.e. rowname / column name
# data.loc[#Rows, #Columns]
# data.loc[]

# The selection can be done by using the labels for both the row (“Serbia”)
# and the column (“gdpPercap_2007”):
# All rows and all columns
df.loc[:]

# Certain rows and certain columns
print(df.loc['Serbia', 'gdpPercap_2007'])
df.loc["Belgium","gdpPercap_1962"]

# All rows and certain columns
df.loc[:, "gdpPercap_1962"]

# Certain rows and all columns
df.loc["Belgium",:]
df.loc[["Albania","Belgium"],:]

df
# Rows Albania, Belgium, Bulgaria; all columns
df.loc[["Albania","Belgium","Bulgaria"]]
print(df.loc['Albania':'Bulgaria'])
df.loc["Albania":"Bulgaria",]
df.loc["Albania":"Bulgaria"]


# Extent of Slicing
# Do the two statements below produce the same output?
# Based on this, what rule governs what is included (or not) in
# numerical slices and named slices in Pandas?
# .loc lets you filter your dataframe by location index value i.e. row index / column index
df.iloc[2,3]
print(df.iloc[0:2, 0:2])
print(df.iloc[0:2, [0,2]])
print(df.iloc[[0,2], [0,2]])
print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])


# Modify DataFrame value: original value = 2312.888958
df.loc["Albania", "gdpPercap_1962"]
# Modify
df.loc["Albania", "gdpPercap_1962"] = 1000
# New value after
df.head(10)
df.loc["Albania", "gdpPercap_1962"]

# To retrieve just one single value, you can also use df.at or df.iat
df.at["Albania", "gdpPercap_1962"]
df.iat[2, 2]

# You can change the index value
# Access a column with Dataframe.index = Dataframe.ColumnName
df.index = df.gdpPercap_1962 # This approach is better if it's a single word column name
df.index = df["gdpPercap_1962"] # This approach is better if there's a space in the column name

#What conclusion can we draw?
#We see that a numerical slice, 0:2, omits the final index (i.e. index 2) in the range provided,
#while a named slice, ‘gdpPercap_1952’:’gdpPercap_1962’, includes the final element.

# Reconstructing data:
# Go through the following piece of code line by line, then
# Explain what each line in the short program does: what is in first, second, etc.?
first = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_all.csv', index_col='country')
second = first[first['continent'] == 'Americas']
third = second.drop('Puerto Rico')
fourth = third.drop('continent', axis = 1)
fourth.to_csv('result.csv')

# The first line loads the dataset containing the GDP data from all countries into a dataframe called first.
# The index_col='country' parameter selects which column to use as the row labels in the dataframe.

# This second line makes a selection: only those rows of first for which
# the ‘continent’ column matches ‘Americas’ are extracted.
# Notice how the Boolean expression inside the brackets,
# first['continent'] == 'Americas', is used to select only those rows where the expression is true.
# Try printing this expression! Can you print also its individual True/False elements?
# (hint: first assign the expression to a variable)

# The third line: As the syntax suggests, this line drops the row from second where the label is ‘Puerto Rico’.
# The resulting dataframe third has one row less than the original dataframe second.

# The forth line: Again we apply the drop function, but in this case we are dropping not a row but a
# whole column.
# To accomplish this, we need to specify also the axis parameter (we want to drop the second column
# which has index 1).

# The fifth line: The final step is to write the data that we have been working on to a csv file.
# Pandas makes this easy with the to_csv() function. The only required argument to the function is the filename.
# Note that the file will be written in the directory from which you started the Jupyter or Python session.

# Selecting Indices
# Explain in simple terms what idxmin and idxmax do in the short program below. When would you use these methods?
data = pd.read_csv('https://raw.githubusercontent.com/laitanawe/pandasds/main/workshops/data/gapminder_gdp_europe.csv', index_col='country')
print(data.idxmin())
print(data.idxmax())

# EXERCISE:
# Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded.
# Write an expression to select each of the following:

# GDP per capita for all countries in 1982.
# GDP per capita for Denmark for all years.
# GDP per capita for all countries for years after 1985.
# GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952.

data['gdpPercap_1982']
data.loc['Denmark',:]
data.loc[:,'gdpPercap_1985':]
#Pandas is smart enough to recognize the number at the end of the column label and does not give you an error,
#although no column named gdpPercap_1985 actually exists.
#This is useful if new columns are added to the CSV file later.
data['gdpPercap_2007']/data['gdpPercap_1952']

# Using the dir function to see available methods
# Python includes a dir function that can be used to display all of
# the available methods (functions) that are built into a data object.
# As an example, the functions available for a list data type are:
potatoes = ["Russet", "Norkota", "Yukon Gold", "Pontiac"]
dir(potatoes)

# ['__add__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

# The double underscore functions can be ignored for now; functions that are
# not surrounded by double underscores are the public interface of the list type.
# So, if you want to sort the list of potatoes, according to dir you should try,
potatoes.sort()

# Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded as data.
# Then, use dir to find the function that prints out the median per-capita GDP
# across all European countries for each year that information is available.
# Among many choices, dir lists the median() function as a possibility. Thus,
data.median()


# Lessons:
# Use DataFrame.iloc[..., ...] to select values by integer location.
# Use : on its own to mean all columns or all rows.
# Select multiple columns or rows using DataFrame.loc and a named slice.
# Result of slicing can be used in further operations.
# Use comparisons to select data based on value.
# Select values or NaN using a Boolean mask.

# OTHER EXAMPLES:
# Print the first few lines of a dataframe using the head function
#print(store.head(15))

# Print the headers of a dataframe
#print(store.columns)

# Print a specific column of a dataframe (Indexing starts from zero in python)
#print(store['median_income'])

# Print a specific column of a dataframe
# (Python skips the ending index when you slice)
#print(store['median_income'][3:10])

# Subset: only print a set of columns
# Read each column
#print(store[['longitude', 'latitude', 'housing_median_age', 'population']])

# Print specific rows of a dataframe using the iloc function
#print(store.iloc[1:6])


#print(data.loc["Albania", :])

# Would get the same result printing data.loc["Albania"] (without a second index).
#print(data.loc[:, "gdpPercap_1952"])

pandas version 2.1.4
1601.056136

1601.056136

gdpPercap_1952    1601.056136
gdpPercap_1957    1942.284244
gdpPercap_1962    2312.888958
gdpPercap_1967    2760.196931
gdpPercap_1972    3313.422188
gdpPercap_1977    3533.003910
gdpPercap_1982    3630.880722
gdpPercap_1987    3738.932735
gdpPercap_1992    2497.437901
gdpPercap_1997    3193.054604
gdpPercap_2002    4604.211737
gdpPercap_2007    5937.029526
Name: Albania, dtype: float64

gdpPercap_1952    1601.056136
gdpPercap_1957    1942.284244
gdpPercap_1962    2312.888958
gdpPercap_1967    2760.196931
gdpPercap_1972    3313.422188
gdpPercap_1977    3533.003910
gdpPercap_1982    3630.880722
gdpPercap_1987    3738.932735
gdpPercap_1992    2497.437901
gdpPercap_1997    3193.054604
gdpPercap_2002    4604.211737
gdpPercap_2007    5937.029526
Name: Albania, dtype: float64
country
Albania                    1601.056136
Austria                    6137.076492
Belgium                    8343.105127
Bosnia and Herzegovina      973.533195
Bulgaria

1000.0