<a href="https://colab.research.google.com/github/olivia-maras/olivia-maras/blob/main/Copy_of_5_ImputingAndUpdatingDataValues.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Imputing and Updating Data Values in Pandas

As we just saw, sometimes it can be tricky to tell just when we've made a change to an underlying data set, and when we're working with a view or a copy. 

While it may seem silly (why not just copy everything?), when you're working with very large data sets, there are a lot of efficiencies to be gained by *not* automatically copying all your data every time you decide you want to look at a different subset of it. As a result, however, it can be easy to run into problems where you can't quite seem to do what you want.

That's because, for the purposes of data coding or analysis, we often want to be able to transform our data by recoding or recasting values, creating new data columns etc. 

In general, the safest way to proceed is to either:

1. Perform any true data transformation on your entire data set, or
2. Explicitly create a copy of a DataFrame before trying to update any values


Bear in mind that the second method, in particular, can start to become memory-intensive. If you really need to "break off" big chunks of your data, consider downloading a separate file and focusing that work in a dedicated notebook.

In [None]:
# first, import the pandas library, giving it a nickname of "pd" for short
import pandas as pd
import numpy as np

In [None]:
# THIS CODE REQUIRED FOR GOOGLE COLAB
# Import PyDrive and associated libraries.
# This only needs to be done once per notebook.
# Documentation found here: https://colab.research.google.com/notebooks/io.ipynb#scrollTo=7taylj9wpsA2
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
# THIS CODE REQUIRED FOR GOOGLE COLAB
# Link to data file stored in Drive: https://drive.google.com/file/d/10P11wZOuwlCN9krxokxmeryOK3mnMZby/view?usp=sharing
file_id = '10P11wZOuwlCN9krxokxmeryOK3mnMZby' # notice where this string comes from in link above

imported_file = drive.CreateFile({'id': file_id}) # creating an accessible copy of the shared data file
print(imported_file['title'])  # it should print the title of desired file
imported_file.GetContentFile(imported_file['title']) # refer to it in this notebook by the same name as it has in Drive

owid-covid-data.csv


In [None]:
# our data is stored as a csv, so we'll use the `read_csv()` method.
# similar methods exist for other data formats, e.g. `read_excel()` or `read_stata()`
# for a complete list of these methods, see https://pandas.pydata.org/docs/reference/io.html

# we're going to take care to name our dataframe and other variables descriptively
# this will make reading our code later much more intuitive
vaccine_data = pd.read_csv('owid-covid-data.csv')

In [None]:
# unemployment data
# Link to data file stored in Drive: https://drive.google.com/file/d/1JqF_AqulFQCeVn1eLCh_mHmBAI7Bw6WL/view?usp=sharing
file_id = '1JqF_AqulFQCeVn1eLCh_mHmBAI7Bw6WL' # notice where this string comes from in link above

imported_file = drive.CreateFile({'id': file_id}) # creating an accessible copy of the shared data file
print(imported_file['title'])  # it should print the title of desired file
imported_file.GetContentFile(imported_file['title']) # refer to it in this notebook by the same name as it has in Drive

wb_employment_data.csv


In [None]:
# now, let's bring in some employment data and see what it contains

unemp_data = pd.read_csv('wb_employment_data.csv')
print(unemp_data.head())

  Country Name Country Code  \
0        Aruba          ABW   
1  Afghanistan          AFG   
2       Angola          AGO   
3      Albania          ALB   
4      Andorra          AND   

                                      Indicator Name  Indicator Code  1960  \
0  Unemployment, total (% of total labor force) (...  SL.UEM.TOTL.ZS   NaN   
1  Unemployment, total (% of total labor force) (...  SL.UEM.TOTL.ZS   NaN   
2  Unemployment, total (% of total labor force) (...  SL.UEM.TOTL.ZS   NaN   
3  Unemployment, total (% of total labor force) (...  SL.UEM.TOTL.ZS   NaN   
4  Unemployment, total (% of total labor force) (...  SL.UEM.TOTL.ZS   NaN   

   1961  1962  1963  1964  1965  ...   2011   2012   2013       2014  \
0   NaN   NaN   NaN   NaN   NaN  ...    NaN    NaN    NaN        NaN   
1   NaN   NaN   NaN   NaN   NaN  ...  11.51  11.52  11.54  11.450000   
2   NaN   NaN   NaN   NaN   NaN  ...   7.36   7.37   7.38   7.310000   
3   NaN   NaN   NaN   NaN   NaN  ...  13.48  13.38  15.8

In [None]:
# and let's remind ourselves of the columns in our vaccine_data
print(vaccine_data.head())

  iso_code continent     location        date  total_cases  new_cases  \
0      AFG      Asia  Afghanistan  2020-02-24          5.0        5.0   
1      AFG      Asia  Afghanistan  2020-02-25          5.0        0.0   
2      AFG      Asia  Afghanistan  2020-02-26          5.0        0.0   
3      AFG      Asia  Afghanistan  2020-02-27          5.0        0.0   
4      AFG      Asia  Afghanistan  2020-02-28          5.0        0.0   

   new_cases_smoothed  total_deaths  new_deaths  new_deaths_smoothed  ...  \
0                 NaN           NaN         NaN                  NaN  ...   
1                 NaN           NaN         NaN                  NaN  ...   
2                 NaN           NaN         NaN                  NaN  ...   
3                 NaN           NaN         NaN                  NaN  ...   
4                 NaN           NaN         NaN                  NaN  ...   

   female_smokers  male_smokers  handwashing_facilities  \
0             NaN           NaN        

In [None]:
# since unemployment is a "lagging" indicator, we're probably not as interested in the unemployment rate in 
# 2020 as the unemployment rate in 2019: if unemployment was high then, it may be a better predictor of a 
# country's economic state in 2020

# first, let's get *just* the unemployment rates for 2019

unemp_2019 = unemp_data.loc[:,['Country Code','2019']]
print(unemp_2019)


    Country Code       2019
0            ABW        NaN
1            AFG  10.980000
2            AGO   6.930000
3            ALB  11.470000
4            AND        NaN
..           ...        ...
259          XKX        NaN
260          YEM  12.900000
261          ZAF  28.469999
262          ZMB  11.910000
263          ZWE   5.020000

[264 rows x 2 columns]


In [None]:
# now, let's merge the unemployment data with our vaccine data

new_data = vaccine_data.merge(unemp_2019, how='left', left_on='iso_code', right_on='Country Code')
print(new_data.head().transpose())


                                                   0            1  \
iso_code                                         AFG          AFG   
continent                                       Asia         Asia   
location                                 Afghanistan  Afghanistan   
date                                      2020-02-24   2020-02-25   
total_cases                                      5.0          5.0   
...                                              ...          ...   
excess_mortality_cumulative                      NaN          NaN   
excess_mortality                                 NaN          NaN   
excess_mortality_cumulative_per_million          NaN          NaN   
Country Code                                     AFG          AFG   
2019                                           10.98        10.98   

                                                   2            3            4  
iso_code                                         AFG          AFG          AFG  
continent

In [None]:
# of course, now we have essentially a duplicate column, so let's get rid of that
# note that in pandas, if there is an "axis" option, 0=rows and 1=columns; 0 is usually the default
# there is also the more intuitive option: new_data.drop(columns=['Country Code'])

trimmed_data = new_data.drop(['Country Code'], axis=1)
print(trimmed_data.head().transpose())


                                                   0            1  \
iso_code                                         AFG          AFG   
continent                                       Asia         Asia   
location                                 Afghanistan  Afghanistan   
date                                      2020-02-24   2020-02-25   
total_cases                                      5.0          5.0   
...                                              ...          ...   
excess_mortality_cumulative_absolute             NaN          NaN   
excess_mortality_cumulative                      NaN          NaN   
excess_mortality                                 NaN          NaN   
excess_mortality_cumulative_per_million          NaN          NaN   
2019                                           10.98        10.98   

                                                   2            3            4  
iso_code                                         AFG          AFG          AFG  
continent

In [None]:
# let's say we want to treat "total_cases_per_million" of less than 1% as NaN

# we're going to use the '.loc' method
#the first parameter identifies the cells to update
#second parameter is cells to change
# the third value is th enew value they should have
# convert values np.nan

trimmed_data.loc[trimmed_data['total_cases_per_million'] <= 10000, 'total_cases_per_million'] = np.nan

AUS_filter = trimmed_data['iso_code'] =='AUS'

just_AUS = trimmed_data[AUS_filter]

print(just_AUS[['date', 'total_cases_per_million']])



             date  total_cases_per_million
10480  2020-01-26                      NaN
10481  2020-01-27                      NaN
10482  2020-01-28                      NaN
10483  2020-01-29                      NaN
10484  2020-01-30                      NaN
...           ...                      ...
11376  2022-07-10               328568.256
11377  2022-07-11               330293.646
11378  2022-07-12               332007.116
11379  2022-07-13               333823.976
11380  2022-07-14               335319.631

[901 rows x 2 columns]


### More ways to create new DataFrames

Sometimes the idea of "cuttng down" a large DataFrame is tedious, especially if there are many columns and you don't want to mess with lots of sub-selecting. Recall that you can always convert a DataFrame column to a list - and lists can be easily "zipped up" into new DataFrames!

Likewise, if all you have are lists, this approach is handy for getting them into a more usable DataFrame format.

What if I want to make a new DataFrame from just a few columns from this existing one? Rather than dropping lots of columns, I can pull out the ones I want into lists and then `zip` them up into a new DataFrame

In [None]:
from pandas._libs.tslibs.timezones import dateutil_gettz
# pull out date, total_cases, new_cases, and new_deaths
# to the `tolist()` function does exactly that

date_list = just_AUS['date'].tolist()
total_list = just_AUS['total_cases'].tolist()
new_cases_list=just_AUS['new_cases'].tolist()
new_deaths_list=just_AUS('new_deaths').tolist()

# zip them up into a new DataFrame

AUS_from_lists=pd.DataFrame(zip(date_list, total_list, new_cases_list, new_deaths_list), columns=['my_date', 'my_total', 'my_new_cases', 'my_new_deaths'])

TypeError: ignored

### Creating new columns from existing data

Sometimes we might want to create a new column based on an existing column. For example, I might want to compile a sort of "sum" column from two existing columns of data. pandas also makes this easy, especially through the use of `lambda` (unnamed) functions, as shown below.

In [None]:
# a `lambda` function is essentially an unnamed function that exists only within the function that 
# calls it. This is handy if you only want to do something very basic to a column, as lambda functions
# can only contain a single expression (e.g. `row.new_cases + row.new_deaths`)


In [None]:
# however, we can also do the same thing with a named function
