### Analyzing the relation between COVID-19 and level of PM 2.5

Preprocessing the datasets for level of PM2.5 in cities in China

UCLA DataFest 2020

Team R++: Kaixin Wang (kaixinwang@ucla.edu), Jingyi Lu (rebeccalu99@ucla.edu)

In this IPython notebook, we will create two output `csv` files by accessing the data for the level of PM 2.5 in cities in China via the website https://quotsoft.net/air/.

At the end of this module, we will generate two output files:

- `Cities_China_PM2.5_2019.csv`: daily average level of PM 2.5 in Beijing, Shanghai and Wuhan (01/01/2019 - 12/31/2019)
- `Cities_China_PM2.5_2020.csv`: daily average level of PM 2.5 in Beijing, Shanghai and Wuhan (01/01/2020 - 05/01/2020)

In [1]:
import numpy as np
import pandas as pd
import requests
from datetime import datetime

    1. Extract information for dates between 2019/01/01 and 2019/12/31

Create a range of dates

In [2]:
dates_2019 = pd.date_range(start="20190101", end="20191231").tolist()
dates_2019 = [str(day.date()) for day in dates_2019]
dates_2019 = [datetime.strptime(day, '%Y-%m-%d').strftime('20%y%m%d') for day in dates_2019]

Read in the file for each date, and store the average amount of PM 2.5 in a row

In [3]:
cities_2019 = pd.DataFrame()
days_2019 = []
for day in dates_2019:
    request = requests.get("https://quotsoft.net/air/data/china_cities_%s.csv" %day)
    if request.status_code == 200: 
        df = pd.read_csv("https://quotsoft.net/air/data/china_cities_%s.csv" %day)
        df = df.loc[df["type"] == "PM2.5", ["北京", "上海", "武汉"]]
        df = df.mean(axis = 0)
        cities_2019 = cities_2019.append(df, ignore_index = True)
        days_2019.append(day)

Reformat the dates

In [4]:
days_2019 = [datetime.strptime(day, "20%y%m%d").strftime('%m/%d/20%y') for day in days_2019]

Rename the column names of the cities, and insert the days as a column

In [5]:
cities_2019.columns = ["Beijing", "Shanghai", "Wuhan"]
cities_2019["Dates"] = days_2019

Check if the dataframe is as desired

In [6]:
cities_2019.head()

Unnamed: 0,Beijing,Shanghai,Wuhan,Dates
0,27.166667,35.5,44.083333,01/01/2019
1,63.666667,47.666667,58.625,01/02/2019
2,48.666667,123.916667,70.125,01/03/2019
3,17.75,24.5,91.0,01/04/2019
4,31.958333,14.458333,158.916667,01/05/2019


Export the dataframe created above to a `csv` file

In [7]:
cities_2019.to_csv("Cities_China_PM2.5_2019.csv", index = False)

    2. Repeat the same procedure for dates between 2020/01/01 and 2020/05/01

In [8]:
dates_2020 = pd.date_range(start="20200101", end="20200501").tolist()
dates_2020 = [str(day.date()) for day in dates_2020]
dates_2020 = [datetime.strptime(day, '%Y-%m-%d').strftime('20%y%m%d') for day in dates_2020]

In [9]:
cities_2020= pd.DataFrame()
days_2020 = []
for day in dates_2020:
    request = requests.get("https://quotsoft.net/air/data/china_cities_%s.csv" %day)
    if request.status_code == 200: 
        df = pd.read_csv("https://quotsoft.net/air/data/china_cities_%s.csv" %day)
        df = df.loc[df["type"] == "PM2.5", ["北京", "上海", "武汉"]]
        df = df.mean(axis = 0)
        cities_2020 = cities_2020.append(df, ignore_index = True)
        days_2020.append(day)

In [10]:
days_2020 = [datetime.strptime(day, "20%y%m%d").strftime('%m/%d/20%y') for day in days_2020]

In [11]:
cities_2020.columns = ["Beijing", "Shanghai", "Wuhan"]
cities_2020["Dates"] = days_2020

In [12]:
cities_2020.head()

Unnamed: 0,Beijing,Shanghai,Wuhan,Dates
0,14.208333,33.708333,45.083333,01/01/2020
1,23.041667,50.875,78.0,01/02/2020
2,39.875,50.5,98.875,01/03/2020
3,62.833333,42.166667,108.875,01/04/2020
4,42.458333,63.708333,61.5,01/05/2020


In [13]:
cities_2020.to_csv("Cities_China_PM2.5_2020.csv", index = False)