<a href="https://colab.research.google.com/github/olivia-maras/olivia-maras/blob/main/Copy_of_2_LoadingAccessingAndExportingData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### About Jupyter Notebooks

This Google Colab notebook is an online version of a Jupyter Notebook, which uses a web browser as the interface for writing, running, and viewing the outputs of code. We will use Colab for the sake of consistency and convenience, but you can also [install and run the software on you computer](https://jupyter.org/) (if you want to do this work without an internet connection) for free.

If you haven't used Jupypter Notebooks before, one thing to know ahead of time is that Jupyter Notebooks have two relevant types of "cells":

The text you are reading now is part of a text "markdown" cell. You can use this to write notes and contextual information about your code, which can also be lightly-formatted using the styles described here: https://www.markdownguide.org/cheat-sheet/

The cells below are code cells. When you hit the "play" button next to a code cell, it essentially does a combination of two things:

Runs the code
Prints out the "results"

Comments (descriptive text that is part of the code but ignored by the computer) in Python are indicated by the hash (#) symbol. You will need one at the start of every line of comments, or in front of any line of code you want ignored:

**Tip**: Commenting out code is a great way to quickly experiment with making changes to working code (or debugging code that isn't working), without losing your previous work! 

### Accessing and Assessing Data in Python with Pandas

If you're not already a little familiar with Python, hopefully you've watched the introductory videos shared before this course, which cover the basics of interacting with Python code, including data types, conditionals and looping. However, since our focus here is on working with data, that is where we'll spend the bulk of our time - specifically working with some of the excellent libraries that make Python such a powerful and versatile tool.

For example, while "raw" Python comes with built-in support for working with a range of file types, for the most part we will be working with the well-known **pandas** library, which can import a wide range of data file formats (everything from CSVs to stata files), and has many useful methods for assessing and transforming that data.

In this exercise, we're going to use pandas to look at daily vaccination data from around the world. Along the way, we're also going to carefully document what we do, to make sure that when we revisit this file in the future, we know what's going on.

In [None]:
# first, import the pandas library, giving it a nickname of "pd" for short
import pandas as pd

### Quirks of Colab

Because we're running code online, the data we want to work with has to be online as well. There's a few different ways to access data in Colab, but the most straightforward (and reliable) way is to upload our data to Google Drive and use a little extra Python code to load data it into our notebook from there. The next couple of cells demonstrate how this is done. We'll read through the comments together so you can understand how to do this with your own data sets.

In [None]:
# THIS CODE REQUIRED FOR GOOGLE COLAB
# Import PyDrive and associated libraries.
# This only needs to be done once per notebook.
# Documentation found here: https://colab.research.google.com/notebooks/io.ipynb#scrollTo=7taylj9wpsA2
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
# Link to data file stored in Drive: https://drive.google.com/file/d/10P11wZOuwlCN9krxokxmeryOK3mnMZby/view?usp=sharing
# This will need to be a file that you have access to via the same Google Account you're using to run Colab.
file_id = '10P11wZOuwlCN9krxokxmeryOK3mnMZby' # notice where this string comes from in link above

imported_file = drive.CreateFile({'id': file_id}) # creating a local holding place for the target data file stored in Drive
print(imported_file['title'])  # printing the title of the target file, so we can feel confident we've got the right one
imported_file.GetContentFile(imported_file['title']) # actually load the data into our local variable, `imported_file`

owid-covid-data.csv


#### Loading different file types

Most of the data we'll be working with here is formatted as comma-separated value (CSV) files, because it is such a common, efficient, open-soure data format. Of course, since we don't always have a choice about what format our source data arrives in, we'll also experiment with some other sources to see how simply Python and pandas can load data in a variety of formats.

In [None]:
# our data is stored as a csv, so we'll use the `read_csv()` method.
# similar methods exist for other data formats, e.g. `read_excel()` or `read_stata()`
# for a complete list of these methods, see https://pandas.pydata.org/docs/reference/io.html

# we're going to take care to name our dataframe and other variables descriptively
# this will make reading our code later much more intuitive

vaccine_data = pd.read_csv('owid-covid-data.csv')


**Pandas converts whatever data it reads into a "dataframe." While dataframes can be used to perform most data operations that you would expect, the structure is sometimes slightly different than you might expect.**

### Getting a sense of our dataset

If we haven't already previewed a particular data set in another program (e.g. a text file), usually it's helpful to get a basic overview of what it contains. pandas has a range of useful methods for generating a quick overview of what's in our data.

In [None]:
# by default, pandas adds an "index" column to every data frame
# if we want to know the total number of rows with *any* data in them
# we can use the `len()` method to get the length of this index column

print(len(vaccine_data.index))

201270


In [None]:
# what if we want to know how many cells are in our data set?

# built in pandas method

print(vaccine_data.size)

#number of rows times number of columns
print(vaccine_data.shape[0]*vaccine_data.shape[1])

13485090
13485090


In [None]:
# knowing what our column headers are is usually important as well

print(vaccine_data.columns)

Index(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
       'new_cases_smoothed', 'total_deaths', 'new_deaths',
       'new_deaths_smoothed', 'total_cases_per_million',
       'new_cases_per_million', 'new_cases_smoothed_per_million',
       'total_deaths_per_million', 'new_deaths_per_million',
       'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients',
       'icu_patients_per_million', 'hosp_patients',
       'hosp_patients_per_million', 'weekly_icu_admissions',
       'weekly_icu_admissions_per_million', 'weekly_hosp_admissions',
       'weekly_hosp_admissions_per_million', 'total_tests', 'new_tests',
       'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
       'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated', 'total_boosters',
       'new_vaccinations', 'new_vaccinations_smoothed',
       't

## Accessing subsets of data

There are all kinds of ways that we might want to access our data. We'll start here with ways to access data by row and/or column, then move on to how we can filter by e.g. a particular value. 

### Accessing data by position: `.iloc`

The main ways of accessing data by position (e.g. row or column) is to use the functions `.loc` and `.iloc`

You can think of these methods as "location" (`.loc`) and "index location" (`.iloc`), which can help when trying to remember which one to use. This means that we can _always_ use `.iloc`, because when pandas converts our data to a DataFrame, it creates an index for every row and column. This means we can use `.iloc` to get data by row, column, or any combination of the two. 

**Tip:** Remember here that Python is _zero-indexed_, and, in most cases, goes _up to but not including_ the second parameter provided when accessing data by index.



In [None]:
# get the first ten rows of data
# note that using the `.iloc` method here is not strictly required, but is helpful for readability and consistency

first_ten = vaccine_data.iloc[0:10]
print(first_ten)

  iso_code continent     location        date  total_cases  new_cases  \
0      AFG      Asia  Afghanistan  2020-02-24          5.0        5.0   
1      AFG      Asia  Afghanistan  2020-02-25          5.0        0.0   
2      AFG      Asia  Afghanistan  2020-02-26          5.0        0.0   
3      AFG      Asia  Afghanistan  2020-02-27          5.0        0.0   
4      AFG      Asia  Afghanistan  2020-02-28          5.0        0.0   
5      AFG      Asia  Afghanistan  2020-02-29          5.0        0.0   
6      AFG      Asia  Afghanistan  2020-03-01          5.0        0.0   
7      AFG      Asia  Afghanistan  2020-03-02          5.0        0.0   
8      AFG      Asia  Afghanistan  2020-03-03          5.0        0.0   
9      AFG      Asia  Afghanistan  2020-03-04          5.0        0.0   

   new_cases_smoothed  total_deaths  new_deaths  new_deaths_smoothed  ...  \
0                 NaN           NaN         NaN                  NaN  ...   
1                 NaN           NaN       

In [None]:
# get last ten rows of data using `.iloc`

last_ten = vaccine_data.iloc[-10:][:]

print(last_ten)

       iso_code continent  location        date  total_cases  new_cases  \
201260      ZWE    Africa  Zimbabwe  2022-07-05     255755.0       29.0   
201261      ZWE    Africa  Zimbabwe  2022-07-06     255805.0       50.0   
201262      ZWE    Africa  Zimbabwe  2022-07-07     255805.0        0.0   
201263      ZWE    Africa  Zimbabwe  2022-07-08     255891.0       86.0   
201264      ZWE    Africa  Zimbabwe  2022-07-09     255924.0       33.0   
201265      ZWE    Africa  Zimbabwe  2022-07-10     255939.0       15.0   
201266      ZWE    Africa  Zimbabwe  2022-07-11     255953.0       14.0   
201267      ZWE    Africa  Zimbabwe  2022-07-12     255981.0       28.0   
201268      ZWE    Africa  Zimbabwe  2022-07-13     255981.0        0.0   
201269      ZWE    Africa  Zimbabwe  2022-07-14     256047.0       66.0   

        new_cases_smoothed  total_deaths  new_deaths  new_deaths_smoothed  \
201260              53.143        5558.0         0.0                1.286   
201261              

In [None]:
# get a subset of rows and columns using integer-based indexing with `iloc`: 
# iloc[row index range, column index range]
# note that the pandas-added index column is considered the first column 

some_subset = vaccine_data.iloc[0:3,0:4]
print(some_subset)

  iso_code continent     location        date
0      AFG      Asia  Afghanistan  2020-02-24
1      AFG      Asia  Afghanistan  2020-02-25
2      AFG      Asia  Afghanistan  2020-02-26


In [None]:
# get all rows and subset of columns using integer-based indexing with `iloc`: 
# iloc[row index range, column index range]
# note that the pandas-added index column is considered the first column

four_cols = vaccine_data.iloc[:,0:4]

print(four_cols)


       iso_code continent     location        date
0           AFG      Asia  Afghanistan  2020-02-24
1           AFG      Asia  Afghanistan  2020-02-25
2           AFG      Asia  Afghanistan  2020-02-26
3           AFG      Asia  Afghanistan  2020-02-27
4           AFG      Asia  Afghanistan  2020-02-28
...         ...       ...          ...         ...
201265      ZWE    Africa     Zimbabwe  2022-07-10
201266      ZWE    Africa     Zimbabwe  2022-07-11
201267      ZWE    Africa     Zimbabwe  2022-07-12
201268      ZWE    Africa     Zimbabwe  2022-07-13
201269      ZWE    Africa     Zimbabwe  2022-07-14

[201270 rows x 4 columns]


**Tip:** The format for index/positional access is \[_index_ : _index_ \]. In pandas, if only one argument is given, it's always interpreted as rows. If either value is absent, it's interpreted as "from the beginning" or "until the end."

+ \[ : 20\] becomes 0-20
+ \[ 10 : \] becomes 10 until the end/highest index
+ \[ : \] becomes "beginning to end", that is all rows or columns, depending on placement

### Accessing data by label: `.loc`

Because pandas adds a numerical index column that effectively labels each row, we can also access data by a combination of row number and column header using the `.loc` method.

To do this, of course, it would first be helpful to recall what our column headers are.



In [None]:
# print out the column headers of our data



In [None]:
# get a subset of rows **and** columns using label-based indexing with `loc`: 
# loc[row label range, column label list]
# note that the pandas-added index column is shown even though it is not a true data column

a_different_dozen = vaccine_data.loc[0:3, ['iso_code','continent','location','date']]

print(a_different_dozen)

  iso_code continent     location        date
0      AFG      Asia  Afghanistan  2020-02-24
1      AFG      Asia  Afghanistan  2020-02-25
2      AFG      Asia  Afghanistan  2020-02-26
3      AFG      Asia  Afghanistan  2020-02-27


In [None]:
# get ALL columns for the first and third rows

just_two_rows= vaccine.data.loc [0,2]

### Accessing subsets of data: "slicing" by cell value

The most readable (and reusable) way to select, for example, all rows with a particular column value, is to use two steps:

1. Assign the condition we want the returned rows to meet (e.g. the value in the `iso_code` column is `MEX`) to a well-named variable
2. Use that variable as the selector from the original dataset

This process will created a "slice" of our original DataFrame that we can use for additional operations and analysis.


**Let's look at all the data for Mexico.**


In [None]:
# create a variable that describes the data values we are interested in

just_MEX = vaccine_data['iso_code'] == 'MEX'

In [None]:
# create a new variable that will hold the "slice" that meets our defined condition

MEX_data = vaccine_data[just_MEX]

print(MEX_data)

       iso_code      continent location        date  total_cases  new_cases  \
116800      MEX  North America   Mexico  2020-01-01          NaN        NaN   
116801      MEX  North America   Mexico  2020-01-02          NaN        NaN   
116802      MEX  North America   Mexico  2020-01-03          NaN        NaN   
116803      MEX  North America   Mexico  2020-01-04          NaN        NaN   
116804      MEX  North America   Mexico  2020-01-05          NaN        NaN   
...         ...            ...      ...         ...          ...        ...   
117721      MEX  North America   Mexico  2022-07-10    6259325.0     9342.0   
117722      MEX  North America   Mexico  2022-07-11    6265311.0     5986.0   
117723      MEX  North America   Mexico  2022-07-12    6301645.0    36334.0   
117724      MEX  North America   Mexico  2022-07-13    6338991.0    37346.0   
117725      MEX  North America   Mexico  2022-07-14    6373876.0    34885.0   

        new_cases_smoothed  total_deaths  new_death

**Tip:** It's always a good idea to confirm the data type of e.g. a particular column before trying to create a filter variable for it, or you may get surprising results!

In [None]:
# show the current data type for every column in the data set
# note that only the first 20 are output when we `print`!

print(vaccine_data.dtypes)


iso_code                                    object
continent                                   object
location                                    object
date                                        object
total_cases                                float64
                                            ...   
human_development_index                    float64
excess_mortality_cumulative_absolute       float64
excess_mortality_cumulative                float64
excess_mortality                           float64
excess_mortality_cumulative_per_million    float64
Length: 67, dtype: object


In [None]:
# now we know, for example, the the `date` column is actually an `object` data type,
# so if we want to filter by date, we'll need to use make our filter a string, most likely  

start_of_2020 = vaccine_data['date'] == '2020-01-01'

print(vaccine_data[start_of_2020])

       iso_code      continent   location        date  total_cases  new_cases  \
6902        ARG  South America  Argentina  2020-01-01          NaN        NaN   
116800      MEX  North America     Mexico  2020-01-01          NaN        NaN   

        new_cases_smoothed  total_deaths  new_deaths  new_deaths_smoothed  \
6902                   NaN           NaN         NaN                  NaN   
116800                 NaN           NaN         NaN                  NaN   

        ...  female_smokers  male_smokers  handwashing_facilities  \
6902    ...            16.2          27.7                     NaN   
116800  ...             6.9          21.4                  87.847   

        hospital_beds_per_thousand  life_expectancy  human_development_index  \
6902                          5.00            76.67                    0.845   
116800                        1.38            75.05                    0.779   

        excess_mortality_cumulative_absolute  excess_mortality_cumulative  

### Meeting multiple conditions

A key advantage of creating variables to hold our various filters is that is allows us to write more reusable - and readable - code, especially when we want to find data that meets multiple conditions. Compare the two following ways to identify countries with more than 200,000 COVID deaths reported as of May 1, 2021.

In [None]:
# selecting the data using a compound conditional directly

over_200K_by_0521_all_one_statement = vaccine_data[(vaccine_data['date']=='2021-05-01') & (vaccine_data['total_deaths'] >= 200000)]

print(over_200K_by_0521_all_one_statement)

        iso_code      continent             location        date  total_cases  \
10040   OWID_ASI            NaN                 Asia  2021-05-01   40032299.0   
25533        BRA  South America               Brazil  2021-05-01   14733396.0   
58694   OWID_EUR            NaN               Europe  2021-05-01   45278086.0   
59598   OWID_EUN            NaN       European Union  2021-05-01   31167352.0   
79149   OWID_HIC            NaN          High income  2021-05-01   73622287.0   
83538        IND           Asia                India  2021-05-01   19557457.0   
106931  OWID_LMC            NaN  Lower middle income  2021-05-01   32299117.0   
117286       MEX  North America               Mexico  2021-05-01    2347780.0   
133216  OWID_NAM            NaN        North America  2021-05-01   37728161.0   
167961  OWID_SAM            NaN        South America  2021-05-01   25004014.0   
190109       USA  North America        United States  2021-05-01   32500398.0   
191832  OWID_UMC            

In [None]:
# create a variable for each condition

# we want the data column to be May 1st 2021

may_first = vaccine_data['date'] = '2021-05-01'

# total deaths column greater than or equal to 200K

over_200k = vaccine_data['total_deaths'] >= 200000)

# now much more readable

over_200K_by_0521 = vaccine_data(may_first & over_200k)

print



## Trip-ups and gotchas

As powerful as it is, working with pandas (and with Colab) can create some surprising/unpredictable situations. We'll review a few common ones here.

+ needing additional libraries e.g. for spss files
+ downloading data needing an additional include

### Libraries that Colab doesn't have

Colab as a **lot** of libraries installed by default, but that doesn't mean it will have every single one you need. Let's see what happens when we try to load the following SPSS `.sav` format file and read it into a pandas DataFrame.

In [None]:
# THIS CODE REQUIRED FOR GOOGLE COLAB
# Link to data file stored in Drive: https://drive.google.com/file/d/1JRkluXYs81IThAG7XFH0ng6Emi67H8oD/view?usp=sharing
# This will need to be a file that you have access to via the same Google Account you're using to run Colab.
sav_file_id = '1JRkluXYs81IThAG7XFH0ng6Emi67H8oD' # notice where this string comes from in link above

imported_sav_file = drive.CreateFile({'id': sav_file_id}) # creating a local holding place for the target data file stored in Drive
print(imported_sav_file['title'])  # printing the title of the target file, so we can feel confident we've got the right one
imported_sav_file.GetContentFile(imported_sav_file['title']) # actually load the data into our local variable, `imported_file`

In [None]:
# load our data

Fortunately, the error message here tells us most of what we need to know: both that we're missing a particular library, and the basics of how to get it. We'll use the recommended `!pip` method to install the library mentioned and try again.

In [None]:
# let's try parsing that data with pandas one more time...

Success! This gives us a basic outline for loading data from basically any type of file, like `.xls` and `.dta` files.

https://drive.google.com/file/d/1YMTaM4a9aU3iqaAvd_ntpYy1NLyB8L6-/view?usp=sharing

Let's start with an `.xls` file

In [None]:
# THIS CODE REQUIRED FOR GOOGLE COLAB
# Link to data file stored in Drive: https://docs.google.com/spreadsheets/d/1k7aW8MG3YaoryuzcPbGnJ-4pb7j_5sS4/edit?usp=sharing
# This will need to be a file that you have access to via the same Google Account you're using to run Colab.
xls_file_id = '1k7aW8MG3YaoryuzcPbGnJ-4pb7j_5sS4' # notice where this string comes from in link above

imported_xls_file = drive.CreateFile({'id': xls_file_id}) # creating a local holding place for the target data file stored in Drive
print(imported_xls_file['title'])  # printing the title of the target file, so we can feel confident we've got the right one
imported_xls_file.GetContentFile(imported_xls_file['title']) # actually load the data into our local variable, `imported_file`

In [None]:
# let's try another data format

In [None]:
# let's try that again

Now we'll do a `.dta` (Stata) file.

In [None]:
# THIS CODE REQUIRED FOR GOOGLE COLAB
# Link to data file stored in Drive: https://drive.google.com/file/d/1YMTaM4a9aU3iqaAvd_ntpYy1NLyB8L6-/view?usp=sharing
# This will need to be a file that you have access to via the same Google Account you're using to run Colab.
dta_file_id = '1YMTaM4a9aU3iqaAvd_ntpYy1NLyB8L6-' # notice where this string comes from in link above

imported_dta_file = drive.CreateFile({'id': dta_file_id}) # creating a local holding place for the target data file stored in Drive
print(imported_dta_file['title'])  # printing the title of the target file, so we can feel confident we've got the right one
imported_dta_file.GetContentFile(imported_dta_file['title']) # actually load the data into our local variable, `imported_file`

In [None]:
# will this work "out of the box?"

## Downloading your data

Whether you just want to transform a file into a friendlier format or output a new, filtered file to work with, exporting your data as a csv is pretty easy. We will need to load up one more Colab-specific library, but then we can download whatever we want (as long as it still exists in this runtime)!

In [None]:
from google.colab import files

# open a writable file and write our converted data to it

In [None]:
# note that we can do this for any DataFrame we've defined