# United States - Crime Rates - 1960 - 2014

### Introduction:

This time you will create a data 

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
import pyspark.pandas as ps

spark = SparkSession.builder.appName("US_Crime_Rates").getOrCreate()



### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/US_Crime_Rates/US_Crime_Rates_1960_2014.csv). 

### Step 3. Assign it to a variable called crime.

In [24]:
url = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/US_Crime_Rates/US_Crime_Rates_1960_2014.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)

crime = spark.read.csv(SparkFiles.get("US_Crime_Rates_1960_2014.csv"),header=True, inferSchema= True)

### Step 4. What is the type of the columns?

In [25]:
crime.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Population: integer (nullable = true)
 |-- Total: integer (nullable = true)
 |-- Violent: integer (nullable = true)
 |-- Property: integer (nullable = true)
 |-- Murder: integer (nullable = true)
 |-- Forcible_Rape: integer (nullable = true)
 |-- Robbery: integer (nullable = true)
 |-- Aggravated_assault: integer (nullable = true)
 |-- Burglary: integer (nullable = true)
 |-- Larceny_Theft: integer (nullable = true)
 |-- Vehicle_Theft: integer (nullable = true)



##### Have you noticed that the type of Year is int64. But pandas has a different type to work with Time Series. Let's see it now.

### Step 5. Convert the type of the column Year to datetime64

In [26]:
to_fake_date = F.udf(lambda x: f'{x}-01-01')

In [27]:
crime = crime.withColumn('Year', F.to_date(to_fake_date(F.col('Year'))))

In [28]:
# still integer
crime.printSchema()

root
 |-- Year: date (nullable = true)
 |-- Population: integer (nullable = true)
 |-- Total: integer (nullable = true)
 |-- Violent: integer (nullable = true)
 |-- Property: integer (nullable = true)
 |-- Murder: integer (nullable = true)
 |-- Forcible_Rape: integer (nullable = true)
 |-- Robbery: integer (nullable = true)
 |-- Aggravated_assault: integer (nullable = true)
 |-- Burglary: integer (nullable = true)
 |-- Larceny_Theft: integer (nullable = true)
 |-- Vehicle_Theft: integer (nullable = true)



### Step 6. Set the Year column as the index of the dataframe

In [21]:
# not applicable

### Step 7. Delete the Total column

In [29]:
crime = crime.drop('Total')

### Step 8. Group the year by decades and sum the values

#### Pay attention to the Population column number, summing this column is a mistake

In [78]:
to_decade = F.udf(lambda x: x // 10, T.IntegerType())

In [79]:
crime = crime.withColumn('10Year', to_decade(F.year(F.col('Year'))))

In [99]:
decade_crimes = crime.groupBy('10Year').agg(
    F.min(F.year(F.col('Year'))).alias('decade'),
    F.max(F.col('Population')).alias('Population'),
    F.sum(F.col('Violent')).alias('Violent'),
    F.sum(F.col('Property')).alias('Property'),
    F.sum(F.col('Murder')).alias('Murder'),
    F.sum(F.col('Forcible_Rape')).alias('Forcible_Rape'),
    F.sum(F.col('Robbery')).alias('Robbery'),
    F.sum(F.col('Aggravated_assault')).alias('Aggravated_assault'),
    F.sum(F.col('Burglary')).alias('Burglary'),
    F.sum(F.col('Larceny_Theft')).alias('Larceny_Theft'),
    F.sum(F.col('Vehicle_Theft')).alias('Vehicle_Theft'),
).orderBy('decade').drop('10Year')


In [100]:
decade_crimes.show()

+------+----------+--------+---------+------+-------------+-------+------------------+--------+-------------+-------------+
|decade|Population| Violent| Property|Murder|Forcible_Rape|Robbery|Aggravated_assault|Burglary|Larceny_Theft|Vehicle_Theft|
+------+----------+--------+---------+------+-------------+-------+------------------+--------+-------------+-------------+
|  1960| 201385000| 4134930| 45160900|106180|       236720|1633510|           2158520|13321100|     26547700|      5292100|
|  1970| 220099000| 9607930| 91383800|192230|       554570|4159020|           4702120|28486000|     53157800|      9739900|
|  1980| 248239000|14074328|117048900|206439|       865639|5383109|           7619130|33073494|     72040253|     11935411|
|  1990| 272690813|17527048|119053499|211664|       998827|5748930|          10568963|26750015|     77679366|     14624418|
|  2000| 307006550|13968056|100944369|163068|       922499|4230366|           8652124|21565176|     67970291|     11412834|
|  2010|

### Step 9. What is the most dangerous decade to live in the US?

In [101]:
# not applicable