In [0]:
from pyspark import SparkContext, SparkConf
cf = SparkConf()
cf.set("spark.submit.deployMode","client")
sc = SparkContext.getOrCreate(cf)
from pyspark.sql import SparkSession
spark = SparkSession \
	    .builder \
	    .appName("Python Spark SQL basic example") \
	    .config("spark.some.config.option", "some-value") \
	    .getOrCreate()
                      

## Data Cleaning: Legally_Operating_Businesses_copy.csv

Start by reading in 'Legally_Operating_Businesses_copy.csv' as a pandas data frame. We are going to be looking at the years and months, so it is important to change the datatypes of the date column to a datetime datatype. We will also change the datatypes of license status column, industry column, address state column, and address city column to string. This is so that all of the entries in this column are uniform. I will also modify the column names from words separated by spaces to words separated by underscores. For example 'Address State' will be renamed 'Address_State'. We are only modifying certain columns because we are only interested in these columns of the dataset. All of this is done, so to make it easier to not only convert the dataframe into a pyspark dataframe, but also to make it easier to use SQL queries on the soon to be created pyspark dataframe.

In [3]:
import numpy as np
import pandas as pd
#read dataset in pandas
#set null values as a string, to make all the columns the same data type - makes it easier to convert to spark data frame
business_license = pd.read_csv('Legally_Operating_Businesses_copy.csv', na_values = "not available")

#set respective columns to its designated datatype
business_license['License Creation Date'] = pd.to_datetime(business_license['License Creation Date']) 
business_license = business_license.astype({'License Status':'string','Industry':'string', 'Address State':'string', 'Address City':'string' })

#rename each column name from separate words to words separated by underscores
business_license = business_license.loc[:, ["License Creation Date", "License Status", "Industry", "Address State", "Address City"]] 
business_license.rename(columns={"License Creation Date": "License_Creation_Date", "License Status": "License_Status", "Address State":"Address_State", "Address City": "Address_City"}, inplace=True)

business_license.dtypes

 
Read the business_license pandas dataframe into a spark dataframe, This way we can now use Pyspark SQL queries to further work on the data

In [5]:
from pyspark.sql.types import *
#create schema for your dataframe
schema = StructType([StructField("License_Creation_Date", DateType(), True)\
                   ,StructField("License_Status",StringType(), True)\
                   ,StructField("Industry", StringType(), True)\
                   ,StructField("Address_State", StringType(), True)\
                   ,StructField("Address_Borough", StringType(), True)])

#create spark dataframe using schema
business_license_df = spark.createDataFrame(business_license,schema=schema)
business_license_df.show()

As we can see not all of the licenses are from New York, so I want to remove the entries of the table that are not businesses from NY. Also we are only focused on New York City, so we will only keep entries that are in the five boroughs (Queens, Manhattan/New York, Brooklyn, Bronx, Staten Island). There are some address boroughs that are labeled as cities within boroughs. We will not be including them because it will be nearly impossible to go through all the cities and match them to the correct borough. 

In [7]:
#remove non NY entries
business_license_df = business_license_df.filter( business_license_df["Address_State"] == "NY")
business_license_df.show()

In [8]:
#remove non NYC entries
#In the dataset NEW YORK is the same as MANHATTAN, so I included both in my list of boroughs
boroughs = ["QUEENS", "MANHATTAN", "NEW YORK", "STATEN ISLAND", "BROOKLYN", "BRONX"]
business_license_df = business_license_df.filter(business_license_df["Address_Borough"].isin(boroughs))
business_license_df.show()

Now we want to look at the number of licenses created each year. We use a pyspark SQL query to do this. After, we then convert the SQL table to a new pyspark dataframe, which we futher convert back into a new pandas dataframe, so that we can export the table into a csv file. 

In [10]:
from pyspark.sql.functions import year

#Use SQL to group the data by years and count number of licenses made

business_license_df.createOrReplaceTempView("license")
business_license_by_year_df = spark.sql("SELECT YEAR(License_Creation_Date) AS Year, COUNT(*) AS Opened_Licenses FROM license GROUP BY YEAR(License_Creation_Date) ORDER BY year")

#turn dataframe into pandas df to export as csv
business_license_by_year = business_license_by_year_df.toPandas()
#rename column back into words separated by spaces
business_license_by_year.rename(columns={'Opened_Licenses': 'Opened Licenses'}, inplace=True)
display(business_license_by_year[business_license_by_year['Year'] > 2010])


In [11]:
#export CSV file
business_license_by_year.to_csv('business_license_by_year_updated.csv')

We are doing the same thing here, except now we are grouping by Borough 

In [13]:
#Use SQL to group the data by years and count number of licenses made
business_license_by_borough_df = spark.sql("SELECT YEAR(License_Creation_Date) AS Year, Address_Borough AS Borough, COUNT(*) AS Opened_Licenses FROM license GROUP BY YEAR(License_Creation_Date), Address_Borough ORDER BY year")
#convert to pandas
business_license_by_borough = business_license_by_borough_df.toPandas()
#rename column
business_license_by_borough.rename(columns={'Opened_Licenses': 'Opened Licenses'}, inplace=True)
display(business_license_by_borough[business_license_by_borough['Year'] > 2018])

In [14]:
#export CSV
business_license_by_borough.to_csv('business_license_by_borough_updated.csv')

Once Again we repeat the same as above, except now we are grouping by industry

In [16]:
#Use SQL to group by indsutry
business_license_by_ind_df = spark.sql("SELECT YEAR(License_Creation_Date) AS Year, Industry, COUNT(*) AS Opened_Licenses FROM license GROUP BY YEAR(License_Creation_Date), Industry ORDER BY year DESC")
#convert to pandas
business_license_by_ind = business_license_by_ind_df.toPandas()
#rename
business_license_by_ind.rename(columns={'Opened_Licenses': 'Opened Licenses'}, inplace=True)
display(business_license_by_ind[business_license_by_ind['Year'] > 2019])

In [17]:
#export
business_license_by_ind.to_csv('business_license_by_industry_updated.csv')

## Data Cleaning: cases-by-day.csv

First we start by reading the 'cases-by-day.csv' file into a pandas dataframe. We are doing this because we want to set the 'date_of_interest' column into a datetime datatype. This is so that later on when we convert it to a spark dataframe and try to perform SQL queries we can access that data by different dates. We then turn the pandas dataframe into a spark dataframe.

In [20]:
#read csv as pandas
covid_numbers = pd.read_csv("cases-by-day.csv")
#turn column into datetime datatype
covid_numbers['date_of_interest'] = pd.to_datetime(covid_numbers['date_of_interest'])

#convert to a spark dataframe and create a tempview for sql queries
covid_df = spark.createDataFrame(covid_numbers)
covid_df.createOrReplaceTempView("covid")
covid_df.show()

The covid dataset is taken per day. However, what we are trying to do is get an average number of cases per month. To do this we perform a SQL query where we group the data by year and then group by month. For each group we will take the average number of cases for that month. 

In [22]:
from pyspark.sql.functions import year, month, max,concat, col
covid_df_update = spark.sql("SELECT YEAR(date_of_interest) AS year, MONTH(date_of_interest) AS month, AVG(ALL_CASE_COUNT_7DAY_AVG) AS cases FROM covid GROUP BY YEAR(date_of_interest), MONTH(date_of_interest) ORDER BY year, month")
covid_df_update.show()


Another issue we are facing is that the year and month are now in separte columns. The next thing we do is combine the two columns. We start by converting the pyspark dataframe into a pandas dataframe. Next we append the year and month together and separting them by a period and store this as a new column in the dataframe. 2020 1 becomes 2020.1. Finally, we select only the cases column and the newly created year_month column. The purpose of combining the month and year columns is so that when we want to plot the data we can have both the year and month included in the plot.

In [24]:
#convert to pandas
covid_numbers = covid_df_update.toPandas()
#combine year and month columns as a new column 'year_month'
covid_numbers['year_month'] = covid_numbers['year'].astype(str) + '.' + covid_numbers['month'].astype(str)
#take only the cases column and 'year_month' column
covid_numbers = covid_numbers.iloc[:,2:4]
covid_numbers

In [25]:
#export as csv
covid_numbers.to_csv('covid_numbers_updated.csv')

## Data Cleaning: savings.csv

I want to compare the data of covid cases to amount of personal savings. From the previous step, the 'covid_numbers_updated.csv' reads the months as '2020.1', '2020.2', etc. The Personal savings.csv does not group the data by this. In the  savings data, the columns are grouped by year in one row and by month in another row. In order to compare the two data sets I want to format the dates of the Personal savings data to match the updated covid numbers data. I start by reading the data into a pandas dataframe.

In [28]:
import decimal
income_disposition = pd.read_csv("savings.csv")
income_disposition.head()

As we can see, the first row of the table lists the months:"JAN","FEB",etc. We also notice that the labels of the years already indicate the month with a number. Therefore, we do not really need the row with the months so we can delete that. We also don't need the first column of the dataset with the line count. We also notice that there is a row in the data that has empty values. We can either delete it or fill it in with 0's, in this case we fill it in with 0's. Also we want to make sure the data is the correct datatype, so we change every column to a float datatype. 

In [30]:
#drop row with months and drop first column with row count
income_disposition = income_disposition.drop(['Line'], axis = 1)
income_disposition= income_disposition.iloc[1:, :]

#fill empty rows with 0
income_disposition = income_disposition.fillna(0)

#change column datatypes to floats
cols = income_disposition.columns
cols = cols[1:]
for col in cols:
    income_disposition[col] = income_disposition[col].astype("float")

income_disposition.head()

Another thing we notice is that the years do not quite match up with year_month labels of our covid data. In our covid data, the year_month is represented as "2020.1" for January 2020, "2020.2" for February 2020, etc. However in the personal savings data, January 2020 is represented just as "2020" while February 2020 is "2020.1. We can see that in our savings data, the years are off by a month. Therefore, when we try to compare the savings data with the covid numbers data, the dates will not line up. The last thing we do is fix this by iterating through each year label in the savings data and adjusting the year to match our covid numbers year_month. 

In [32]:
#rename columns
#create a list of the new column names
new_cols = []
for i in range(len(income_disposition.columns)):
  if i == 0:
    new_cols.append("type")
    continue

  name = decimal.Decimal(income_disposition.columns[i])
  if income_disposition.columns[i][-2] == '.':
    if(income_disposition.columns[i][-1] == '9'):
      new_name = income_disposition.columns[i].replace('.9', '.10')
    else:
      new_name = name + decimal.Decimal('0.1')
    new_cols.append(str(new_name))
  elif income_disposition.columns[i][-3] == '.':
    new_name = name + decimal.Decimal('0.01')
    new_cols.append(str(new_name))
  else:
    new_name = name + decimal.Decimal('0.1')
    new_cols.append(str(new_name))

#replace the column names with new column names
income_disposition.columns = new_cols
income_disposition.head()

In [33]:
#export as csv
income_disposition.to_csv('savings_updated.csv')

## Data Cleaning: us_small_bus.csv

Here is a simple data set about the number of US small businesses over the years. It is a small dataset, that does not require much cleaning. However, there are small things that need to be fixed. We start by reading the dataset into a pandas dataframe. 

In [36]:
US_business_df = pd.read_csv("us_small_bus.csv")
US_business_df.head()

As we can see the first column 'Line' is unecessary so we can drop it. There are also rows that are all 0's. These will also be unecessary so we can drop those rows as well. We can also see that the second column is named 'Unnamed:1' which is not very informative. We can rename this column to 'industry'. Throughout the dataset, there are some strangely named entries such as "Farms2". Because the dataset is small, we can individually change the names of each of these entries. One last thing that is easy to overlook is that each entry in the "Unnamed:1" column (renamed to "industry") has leading white spaces. This will cause unecceasry issues later, so we can just remove all leading white space in each entry.

In [38]:
#drop rows that are all 0's
US_business_df = US_business_df.drop([0,6])

#drop column 'Line' because it is not necessary
US_business_df = US_business_df.drop(columns=['Line'])

#rename the variables of two different cells
US_business_df.iloc[0,0] = "Self-employed persons"
US_business_df.iloc[2,0] = "Farms"
US_business_df.iloc[14,0] = "Professional and business services"

#rename first column
US_business_df.rename(columns={'Unnamed: 1': 'industry'}, inplace=True)

#clean up the leading white spaces in the industry column
num_industries = len(US_business_df)
for i in range(num_industries):
    cleaned_word = US_business_df.iloc[i,0].strip()
    US_business_df.iloc[i,0] = cleaned_word

US_business_df.head()

In [39]:
#export as csv
US_business_df.to_csv('us_small_bus_updated.csv')