### Spark DataFrames: Exploring Chicago Crimes

En este notebook se analizarán los datos de crímenes de [data.gov](https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD). En éste dataset se encuentran reportados los crímenes que ocurrieron en la ciudad de Chicago desde el año 2001.

Primero, iniciamos una SparkSession...

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Chicago_crime_analysis").getOrCreate()

In [None]:
from pyspark.sql.types import  (StructType, 
                                StructField, 
                                DateType, 
                                BooleanType,
                                DoubleType,
                                IntegerType,
                                StringType,
                               TimestampType)
from pyspark.sql.functions import col


#### Creamos el dataframe crimes

In [None]:
crimes = spark.read.csv("data/Crimes_2001_toPresent.csv", header = True) 

Primero mostramos cuantos registros tiene el dataframe:

In [None]:
print(" The crimes dataframe has {} records".format(crimes.count()))

In [None]:
crimes = crimes.withColumn("Arrest", (crimes.Arrest).cast('Boolean'))\
               .withColumn("Domestic", (crimes.Domestic).cast('Boolean'))\
               .withColumn("X Coordinate", (col("X Coordinate").cast('Double')))\
               .withColumn("Y Coordinate", (col("Y Coordinate").cast('Double')))\
               .withColumn("Year", (crimes.Year.cast('Integer')))\
               .withColumn("Updated On", (col("Updated On").cast('Date')))\
               .withColumn("Latitude", (crimes.Latitude.cast('Double')))\
               .withColumn("Longitude", (crimes.Longitude.cast('Double')))

Con los comandos que siguen podemos visualizar las columnas del DF y sus tipos de datos.

In [None]:
crimes.columns

In [None]:
crimes.dtypes

In [None]:
crimes.printSchema()

In [None]:
crimes.select("*").show(10)

In [None]:
crimes.select("Date").show(10, truncate = False)

#### Cambio de tipo de dato de una columna

La columna "Date" se encuentra en formato string. Vamos a cambiarla al tipo de dato timestamp utilizando una udf.f).

Con **withColumn** creamos una columna nueva y con **drop** la eliminamos.

In [None]:
from datetime import datetime
from pyspark.sql.functions import col,udf

myfunc =  udf(lambda x: datetime.strptime(x, '%m/%d/%Y %I:%M:%S %p'), TimestampType())
df = crimes.withColumn('Date_time', myfunc(col('Date'))).drop("Date")

df.select(df["Date_time"]).show(5)

In [None]:
df.dtypes

### Uso de describe para las columnas numéricas



In [None]:
crimes.select(["Latitude","Longitude","Year","X Coordinate","Y Coordinate"]).describe().show()

Usamos **format_number** para cambiar el formato de algunas variables.

In [None]:
from pyspark.sql.functions import format_number

In [None]:
result = crimes.select(["Latitude","Longitude","Year","X Coordinate","Y Coordinate"]).describe()
result.select(result['summary'],
              format_number(result['Latitude'].cast('float'),2).alias('Latitude'),
              format_number(result['Longitude'].cast('float'),2).alias('Longitude'),
              result['Year'].cast('int').alias('year'),
              format_number(result['X Coordinate'].cast('float'),2).alias('X Coordinate'),
              format_number(result['Y Coordinate'].cast('float'),2).alias('Y Coordinate')
             ).show()

#### Cuantos tipos primarios de crímenes existen?


In [None]:
crimes.select("Primary Type").distinct().count()

Podemos visualizarlos en una lista:

In [None]:
crimes.select("Primary Type").distinct().show(n = 35)

#### Ejercicio - Cuántos homicidios hay en el dataframe ?

#### Ejercicio - Cuántos asaltos domésticos hay ?


In [None]:
columns = ['Primary Type', 'Description', 'Arrest', 'Domestic']

crimes.where((crimes["Primary Type"] == "HOMICIDE") & (crimes["Arrest"] == "true"))\
                                                        .select(columns).show(10)

Podemos usar **limit** para limitar la cantidad de filas que obtenemos del dataframe. 

In [None]:
crimes.select(columns).limit(11). show(truncate = True)

In [None]:
lat_max = crimes.agg({"Latitude" : "max"}).collect()[0][0]

print("The maximum latitude values is {}".format(lat_max))


#### Ejercicio - crear una nueva columna "difference_from_max_lat" con la diferencia respecto a la variable "lat_max" y mostrar el resultado (las primeras 5 filas)


#### Cómo renombrar columnas


In [None]:
df = crimes.withColumnRenamed("Latitude", "Lat")
df.columns

In [None]:
columns = ['Primary Type', 'Description', 'Arrest', 'Domestic','Lat']

df.orderBy(df["Lat"].desc()).select(columns).show(10)

### Funciones de spark para promedios, máximos y mínimos

Calculamos el promedio del valor de la latitud.

In [None]:
from pyspark.sql.functions import mean
df.select(mean("Lat")).alias("Mean Latitude").show()

También podemos calcular el promedio con la función **agg**.

In [None]:
df.agg({"Lat":"avg"}).show()

In [None]:
from pyspark.sql.functions import max,min

In [None]:
df.select(max("X coordinate"),min("X coordinate")).show()

#### Que porcentaje de crímenes son "Domestic"

In [None]:
df.filter(df["Domestic"]==True).count()/df.count() * 100

#### Coeficiente de correlacion de Pearson entre Lat y Y coordinate?

In [None]:
from pyspark.sql.functions import corr
df.select(corr("Lat","Y coordinate")).show()

#### Cantidad de crímenes por año

In [None]:
df.groupBy("Year").count().show()

In [None]:
df.groupBy("Year").count().collect()

####    Usamos matplotlib y pandas para plotear la cantidad de crímenes por año

In [None]:
count = [item[1] for item in df.groupBy("Year").count().collect()]
year = [item[0] for item in df.groupBy("Year").count().collect()]

In [None]:
number_of_crimes_per_year = {"count":count, "year" : year}

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
number_of_crimes_per_year = pd.DataFrame(number_of_crimes_per_year)

In [None]:
number_of_crimes_per_year.head()

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

In [None]:
number_of_crimes_per_year = number_of_crimes_per_year.sort_values(by = "year")

number_of_crimes_per_year.plot(figsize = (20,10), kind = "bar", color = "red",
                               x = "year", y = "count", legend = False)

plt.xlabel("", fontsize = 18)
plt.ylabel("Number of Crimes", fontsize = 18)
plt.title("Number of Crimes Per Year", fontsize = 28)
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.show()

### Dónde ocurren la mayoría de los crímenes?

In [None]:
crimes.groupBy("Location Description").count().show()

In [None]:
crime_location  = crimes.groupBy("Location Description").count().collect()
location = [item[0] for item in crime_location]
count = [item[1] for item in crime_location]
crime_location = {"location" : location, "count": count}
crime_location = pd.DataFrame(crime_location)
crime_location = crime_location.sort_values(by = "count", ascending  = False)
crime_location.iloc[:5]

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

In [None]:
crime_location = crime_location.iloc[:20]

myplot = crime_location .plot(figsize = (20,20), kind = "barh", color = "#b35900", width = 0.8,
                               x = "location", y = "count", legend = False)

myplot.invert_yaxis()

plt.xlabel("Number of crimes", fontsize = 28)
plt.ylabel("Crime Location", fontsize = 28)
plt.title("Number of Crimes By Location", fontsize = 36)
plt.xticks(size = 24)
plt.yticks(size = 24)
plt.show()