# Class 5: Tutorial --- Reading Files Directly From GitHub

In the exercise for class 5, we downloaded a lot of files from GitHub and saved them locally on our computers. 

However, we avoid this step to save time. Instead, we can read data directly from GitHub. 

This short tutorial illustrates how. 

## Importing Modules

We start by importing the necessary modules. 

The `tqdm` module is not strictly necessary, but is a nice way to generate progress bars in loops. 

In [None]:
import pandas as pd

from tqdm import tqdm

Next, we specify a list of filenames that we want to read. In this case we have several files. We store the filenames in a list. 
Note that the filetype (`.csv`) is not listed yet. We provide that later.

In [None]:
# Generate file ids
files = ['20001', 
         '20011',
         '20012',
         '20021',
         '20031',
         '20041',
         '20042',
         '20051',
         '20061',
         '20071',
         '20072',
         '20081',
         '20091',
         '20101',
         '20102',
         '20111',
         '20121',
         '20131',
         '20141',
         '20142',
         '20151',
         '20161',
         '20171',
         '20181',
         '20182',
         '20191',
         '20201',
         '20211']

We then specify the base URL from which we want to find the files. This can be accessed with the website *https://raw.githubusercontent.com/USERNAME/REPO/master* where *USERNAME* is the username of the GitHub account and *REPO* is the name of the repo. 

In this case, *USERNAME*=mraskj and *REPO*=css_fall2023. 

When we go to the GitHub (https://github.com/mraskj/css_fall2023), we can see that our files are located in *data/ft-speeches*. Hence, we define the base url as:

In [None]:
# Specify base url
base_url = 'https://raw.githubusercontent.com/mraskj/css_fall2023/master/data/ft-speeches/'

We are now ready to read in the data in a similar way as the solution notebook suggests. Note that the filetype `.csv` is added to the string. Without it, the url can not be found and you will an error:

`HTTPError: HTTP Error 404: Not Found`

In [None]:
# Read in data
df = pd.DataFrame()
for file in tqdm(files):
    df_term = pd.read_csv(base_url + file + '.csv')
    df = pd.concat([df, df_term])
df.reset_index(drop=True, inplace=True)

In [None]:
# Illustration of HTTPError: HTTP Error 404: Not Found
pd.read_csv(base_url + files[0])