# Spark Hands On Lab - Python

#### Let's first down load the data that we're going to analyze from GitHub. For purposes of this lab, you'll just land the data on temporary storage. In practice, the data would reside on object storage or in a database.

#### The data file contains vehicle accident data and has a size of about 6.5 MB.

#### Next we'll read the file into a SparkData frame. We specify that the file is in csv format and that the first row of the file contains column names. We are also asking Spark to infer the schema and assign data types. We'll then print out the first two rows of the DataFrame.

#### Here's how you do it.

    df_data_1 = spark.read.format('csv').options(header='true', inferschema='true').load("ACCIDENT2007-FullDataSet.csv")
    df_data_1.take(2)    


#### Let's look at the infered schema of the resulting DataFrame.

    df_data_1.printSchema()

#### Now that you have a DataFrame object, you can do analysis on it such as calculating the correlation between various variables.

#### You can see the correlation between whether the individual was drunk and if the accident resulted in fatalities by performing a simple Pearson correlation on two variables. 

    df_data_1.corr('DRUNK_DR','FATALS')

#### You can also mport your favorite libraries to use within your notebook, for example, the popular Pandas Python library. 

#### Import the pandas library, use it to convert the Spark DataFrame to a Pandas DataFrame, and use the head function to show the top 5 rows values.

    import pandas as pd
    pd_fars = df_data_1.toPandas()
    pd_fars.head()

#### You can also get summary statistics using the describe function.  This can help you determine missing values and the distribution of your attributes.    

    pd_fars.describe()

#### Now we want to look at an individual states worth of data.  The Spark DataFrame object supports a filter option to allow you to filter the data based on a column of interest (ex. STATE) and the resulting value. The Fatality Analysis Reporting System (FARS) Analytical User's Manual provides a reference to the state codes (https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/812092) .  In this example, we will look at the state of California - state code CA.  

#### Next., convert the results to a new Panda dataframe and view the first 5 rows data. 

#### The code should look like this.

    df_cal=df_data_1.filter(df_data_1['STATE']==6)
    pd_cal=df_cal.toPandas()
    pd_cal.head(5)


#### Spark SQL enables applications to run SQL queries programmatically and return the result as a DataFrame. Let's perform the same query we did above, but this time using SQL.

#### First you need to create a Temporary Table

    df_data_1_tempTable = df_data_1.registerTempTable("tempTable")

#### Now run a similar query as above using SQL, but this time only selecting a subset of the columns for which to return results. Note that the query result are the same as above, but with the SQL result only returning the subset of columns that were selected.

    sql = "select STATE, COUNTY, MONTH, DAY, HOUR, MINUTE, VE_TOTAL, PERSONS, PEDS, NHS from tempTable where STATE = 6"
    df_cal2 = spark.sql(sql)
    pd_cal2=df_cal2.toPandas()
    pd_cal2.head(5)

#### Let's now map out the occurrences for the state of interest (CA). Note that the LATITUDE and LONGITUD values were inferred as integers. These need to be converted to float. We can create new columns for a modified version of these fields so that we can map the individual occurances on a map.

#### Use lamba (anonymous) functions to create a lon and lat column that represents the expected values.

    pd_cal['lat'] = pd_cal['LATITUDE'].map(lambda x: (x * 1.0) / 1000000)
    pd_cal['lon'] = pd_cal['LONGITUD'].map(lambda x: (x * -1.0) / 1000000)
    pd_cal.head(5)

#### Now that we have the data we need, we can import the Brunel package and use it to show a graphical map of the occurances. The Brunel Visualization Language is a high-level language developed and open-sourced by IBM. Brunel describes visualizations in terms of composable actions, and drives a visualization engine (D3) that performs the actual rendering and interactivity. 

#### Use the following code to display the map for your state using the lon and lat values and use PERSONS to display a color scale based on the number of individuals involved in the accident.

    import brunel
    %brunel map ('CA') + data('pd_cal') x(lon) y(lat) color(PERSONS) tooltip(VE_TOTAL)

#### PixieDust is an open source Python helper library developed by IBM that works as an add-on to Jupyter notebooks to improve the user experience of working with data. Pixie dust provides an easy way to visualize the data using various table, charts, and maps.  

To import the PixieDust package, you simply need to use and import statement. Once imported, you bring up the interactive display area by using the display function on your dataset.
    
    from pixiedust.display import *
    display(df_data_1)

The initial display is a table view of the dataframe.  

Switch views to the pie chart by selecting the middle charting drop down menu at the top left of the display area. This will display a pie chart of the count of accidents by state along with a percentage.  You can view and modify the options used for the display by selecting the paint brush icon at the bottom left of the display area (note this may be invisible until you hover near the area with your mouse).  If you change the value to city instead of state, you will see a busier graph.

You can switch to a different dataframe at anytime by changing the value in the display parameter and rerunning the cell.  

Change the display to use the california data display(df_cal).

select the histogram chart from the drop down and then modify the values to contain city.  

#### At this point we are ready to start our first pass at building a predictive model.  In this example we will use the statsmodel.formula.api package.  

#### Use the following code to build a basic linear regression.

    import statsmodels.formula.api as sm

    result = sm.ols(formula="FATALS ~ VE_TOTAL + PERSONS + WEATHER + VE_FORMS", data=pd_cal).fit()
    print result.params

To see a summary of all the results, use

    print result.summary()

As you can see from the results, this is not a good model at all.  Select a different set of attributes to see if you can improve the R-squared value.