## Members

| Name             | e-mail                   | GitHub Username  |
|------------------|--------------------------|------------------|
| Hector Sanchez   | hector.sanchez42@upr.edu | Jisanos          |
| Juan Hernandez   | juan.hernandez41@upr.edu | JuanHdez1        |
| Adam Nunez       | adam.nunez@upr.edu       | grumpyandres     |
| Jasiel Rivera    | jasiel.rivera@upr.edu    | jasielrt95       |
| Angel Romero     | angel.romero5@upr.edu    | AngelRomero5     |
| Sergio Mattei    | sergio.mattei@upr.edu    | matteing         |
| Hallyma Gauthier | hallyma.gauthier@upr.edu | hallyma-gauthier |

# Is ther a relation between unemployment rates and violent crimes in the municipalities of Puerto Rico?

Before the group came up with this question, we performed some data gathering and cleanup tasks to determine what data existed and if they contained enough metrics for us to come up with something worth while.

The data found ranged from the following:

- [Employment and Unemployments in each municipality of PR](https://indicadores.pr/id/dataset/estadisticas-de-desempleo-de-area-local/resource/3782a13d-96dd-4cab-adb2-5ace776c6f8a)
- [Bankrupcy Filings in PR](https://estadisticas.pr/files/inventario/quiebras/2023-04-17/USBC-BankruptcyFilings-2023-03.pdf)
- [Criminality (Type 1 Offences) in each municipality of PR](https://estadisticas.pr/en/inventario-de-estadisticas/delitos_tipo_i)
- [Electricity Outages in PR by LUMA](https://luma-outages.jpadilla.com/data?sql=select+*+from+customers)
- [Generation, consumption, cost, increesos and customers of the PR electrical system (LUMA)](https://www.indicadores.pr/dataset/generacion-consumo-costo-ingresos-y-clientes-del-sistema-electrico-de-puerto-rico/resource/8025f821-45c1-4c6a-b2f4-8d641cc03df1)
- [Population Change in PR](https://censo.estadisticas.pr/EstimadosPoblacionales)
- [Earthquake Activity in PR](https://redsismica.uprm.edu/spanish/sismicidad/catalogos/catalogo_general.php)
- [Registered Deaths in PR](https://datos.estadisticas.pr/dataset/defunciones-registradas)

Each member was assigned to clean one of these datasets so that we could at the end determine possible ideas for the project. These ideas ended up ranging from:

- Correlation between unemployment and bankrupcy
- Correlation between unemployment and type 1 offences
- Correlation between mortality rate and type 1 offeneces
- Correlation between electricity outages and earthquakes
- Correlation between economic factors, natural disasters, mortality rate, electricity outages, and increase or reduction of population
- Correlation between bankrupcy and COVID-19 cases.

Given what most thought was most important, and what data was actually available for us to tackle, we decided to go with the current topic.

## Cleaning the Data

### Type 1 Offences

In [1]:
import pandas as pd
from glob import glob
import os

In [2]:
# Finding all xls files in the directory of delitos
files = glob("./data/raw/Delitos Tipo 1/*/*.xls")
files.extend(glob("./data/raw/Delitos Tipo 1/*/*.xlsx"))

In [3]:
# When importing each xls, a new column will be created to identify the year-month by its file name
df = pd.DataFrame()
for f in files:
    
    tmp = pd.read_excel(f, sheet_name="MUNICIPIOS")
    # Dropping rows with all nans before assigning year_month value
    tmp.dropna(how='all', inplace=True)
    # Dropping columns with all nans as well
    tmp.dropna(axis=1, how='all', inplace=True)
    
    # In the case that the columns have inconsistent names,
    # they will be renamed appropriately.
    tmp.rename({"Trata Humana":"Trata Hum.",
                "Agresión Grave":"Agr. Grave",
                "Violación":"Viol.",
                "Apropiación Ilegal":"Apr. I",
                "Escalamiento":"Esc.",
                "Hurto Auto":"H. Auto",
                "Asesinato":"Ases.",
                }, axis=1, inplace=True)
    
    year_month = os.path.basename(f).removeprefix("Policia_DelitosTipoI_").removesuffix(".xls").removesuffix(".xlsx")
    year_month = year_month[:4] + '-' + year_month[4:]
    tmp['Date'] = year_month
    
    df = pd.concat([df, tmp])


In [4]:
df.reset_index(inplace=True, drop=True)

In [5]:
# Will drop TOTAL rows from district
df = df[~(df['Distrito'] == 'TOTAL')]

In [6]:
# Will drop randomly inserted row that repeats the header
df = df[~(df['Distrito'] == 'Distrito')]

In [7]:
df.reset_index(inplace=True, drop=True)

In [8]:
df.columns
# Some column names have been shortened so they must be merged
# since they represent the same values. This rename will be done
# before concattenating the dataframes.

Index(['Distrito', 'Tipo I', 'Ases.', 'Viol.', 'Robo', 'Agr. Grave', 'Esc.',
       'Apr. I', 'H. Auto', 'Date', 'Trata Hum.', 'Unnamed: 10', 'Unnamed: 9',
       'Unnamed: 20', 'AREA'],
      dtype='object')

In [9]:
assoc_columns = {"Trata Humana":"Trata Hum.",
                "Agresión Grave":"Agr. Grave",
                "Violación":"Viol.",
                "Apropiación Ilegal":"Apr. I",
                "Escalamiento":"Esc.",
                "Hurto Auto":"H. Auto",
                "Asesinato":"Ases.",
                }

In [10]:
df.isna().sum()

Distrito          21
Tipo I            21
Ases.            181
Viol.            333
Robo              77
Agr. Grave        26
Esc.              28
Apr. I            21
H. Auto          107
Date               0
Trata Hum.      5201
Unnamed: 10    10315
Unnamed: 9     10298
Unnamed: 20    10304
AREA           10239
dtype: int64

In [11]:
# Trata humana has too many missing values to be significant
df.drop('Trata Hum.', axis=1, inplace=True)

In [12]:
df

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date,Unnamed: 10,Unnamed: 9,Unnamed: 20,AREA
0,Adjuntas,117,0,0,6,5,32,69,5,2014-12,,,,
1,Aguada,323,4,2,19,13,78,194,13,2014-12,,,,
2,Aguadilla,736,5,2,30,24,280,374,21,2014-12,,,,
3,Aguas Buenas,261,6,0,28,14,102,89,22,2014-12,,,,
4,Aibonito,232,8,0,14,15,62,104,29,2014-12,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10312,Vega Baja,155,5,1,9,14,38,74,14,2020-04,,,,
10313,Vieques,47,3,0,0,11,16,15,2,2020-04,,,,
10314,Villalba,14,0,0,1,5,3,5,0,2020-04,,,,
10315,Yabucoa,32,0,2,0,8,9,12,1,2020-04,,,,


In [13]:
# sorted(list(df['Date'].unique()))

In [14]:
df[~df['Distrito'].isna()]

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date,Unnamed: 10,Unnamed: 9,Unnamed: 20,AREA
0,Adjuntas,117,0,0,6,5,32,69,5,2014-12,,,,
1,Aguada,323,4,2,19,13,78,194,13,2014-12,,,,
2,Aguadilla,736,5,2,30,24,280,374,21,2014-12,,,,
3,Aguas Buenas,261,6,0,28,14,102,89,22,2014-12,,,,
4,Aibonito,232,8,0,14,15,62,104,29,2014-12,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10312,Vega Baja,155,5,1,9,14,38,74,14,2020-04,,,,
10313,Vieques,47,3,0,0,11,16,15,2,2020-04,,,,
10314,Villalba,14,0,0,1,5,3,5,0,2020-04,,,,
10315,Yabucoa,32,0,2,0,8,9,12,1,2020-04,,,,


In [15]:
df[~(df['AREA'].isna())]
# Area was oddly added in the Policia_DelitosTipoI_201612 spreadsheet

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date,Unnamed: 10,Unnamed: 9,Unnamed: 20,AREA
5636,Adjuntas,78,0.0,0.0,2.0,18,30,28,0.0,2016-12,,,,UTUADO
5637,Aguada,256,2.0,1.0,14.0,41,78,112,8.0,2016-12,,,,AGUADILLA
5638,Aguadilla,598,9.0,2.0,27.0,92,174,281,13.0,2016-12,,,,AGUADILLA
5639,Aguas Buenas,201,8.0,1.0,24.0,14,44,81,29.0,2016-12,,,,CAGUAS
5640,Aibonito,206,6.0,0.0,13.0,16,52,105,14.0,2016-12,,,,AIBONITO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5709,Vega Baja,936,6.0,3.0,76.0,35,262,467,87.0,2016-12,,,,BAYAMON
5710,Vieques,161,2.0,2.0,1.0,27,51,77,1.0,2016-12,,,,FAJARDO
5711,Villalba,138,1.0,1.0,1.0,28,35,67,5.0,2016-12,,,,PONCE
5712,Yabucoa,212,5.0,1.0,13.0,26,66,97,4.0,2016-12,,,,HUMACAO


In [16]:
df.drop('AREA',axis=1,inplace=True)

In [17]:
# df[~(df['Unnamed: 5'].isna())]
# There are some weird cases where the header did
# not extract properly and the column positions shifted.
# This seems to happen when the first row in the MUNICIPIOS
# sheet is empty, so the header is not extracted properly.
# Straightforward fix is to remove the empty row from the
# spreadsheet directly and manually.

In [18]:
df[~(df['Unnamed: 10'].isna())]
# Unnamed: 10 seems to contain the data of Tipo 1 of some
# of the spreadsheets I converted from pdf to xls online.
# This happens because some of the columns are merged in the
# spreadsheet so the easiest solution is to fix it directly on
# the sheets by merging those cases.
# Once fixing that, it can be dropped

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date,Unnamed: 10,Unnamed: 9,Unnamed: 20
1716,,,,,,,,,,2015-11,,,
6416,,,,,,,,,,2016-03,,,


In [19]:
df[~(df['Unnamed: 9'].isna())]
# Nothing significant here. Can be dropped.

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date,Unnamed: 10,Unnamed: 9,Unnamed: 20
2185,,,,,,,,,,2013-12,,,
3200,,,,,,,,,,2012-06,,,
3825,,,,,,,,,,2011-01,,,
3904,,,,,,,,,,2011-02,,,
3983,,,,,,,,,,2011-03,,,
4062,,,,,,,,,,2011-04,,,
4141,,,,,,,,,,2011-05,,,
4766,,,,,,,,,,2010-01,,,
4845,,,,,,,,,,2010-09,,,
4924,,,,,,,,,,2010-02,,,


In [20]:
df.drop('Unnamed: 9', axis=1, inplace=True)

In [21]:
df[~(df['Unnamed: 10'].isna())]

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date,Unnamed: 10,Unnamed: 20
1716,,,,,,,,,,2015-11,,
6416,,,,,,,,,,2016-03,,


In [22]:
df.drop('Unnamed: 10', axis=1, inplace=True)

In [23]:
df[~(df['Unnamed: 20'].isna())]
# Sometimes a 3 slips here, this column can be dropped

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date,Unnamed: 20
2134,Florida,160,4,0,7,7,69,68,5,2013-12,3.0
3149,Florida,60,0,0,4,0,33,19,4,2012-06,3.0
3853,Florida,34,7,0,2,0,17,6,2,2011-02,3.0
3932,Florida,37,7,0,2,0,19,7,2,2011-03,3.0
4011,Florida,42,7,0,2,0,23,8,2,2011-04,3.0
4090,Florida,55,7,0,2,1,34,8,3,2011-05,3.0
4873,Florida,24,0,0,0,0,15,6,3,2010-02,3.0
4952,Florida,32,0,0,1,0,21,7,3,2010-03,3.0
5110,Florida,47,1,0,4,0,25,11,6,2010-04,3.0
5189,Florida,61,1,0,4,0,34,14,8,2010-05,3.0


In [24]:
df.drop('Unnamed: 20', axis=1,inplace=True)

In [25]:
df[~(df['Distrito'].isna())]
# All rows where Distrito is nan can be dropped.

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date
0,Adjuntas,117,0,0,6,5,32,69,5,2014-12
1,Aguada,323,4,2,19,13,78,194,13,2014-12
2,Aguadilla,736,5,2,30,24,280,374,21,2014-12
3,Aguas Buenas,261,6,0,28,14,102,89,22,2014-12
4,Aibonito,232,8,0,14,15,62,104,29,2014-12
...,...,...,...,...,...,...,...,...,...,...
10312,Vega Baja,155,5,1,9,14,38,74,14,2020-04
10313,Vieques,47,3,0,0,11,16,15,2,2020-04
10314,Villalba,14,0,0,1,5,3,5,0,2020-04
10315,Yabucoa,32,0,2,0,8,9,12,1,2020-04


In [26]:
df=df[~(df['Distrito'].isna())]

In [27]:
df.reset_index(inplace=True, drop=True)

In [28]:
df

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date
0,Adjuntas,117,0,0,6,5,32,69,5,2014-12
1,Aguada,323,4,2,19,13,78,194,13,2014-12
2,Aguadilla,736,5,2,30,24,280,374,21,2014-12
3,Aguas Buenas,261,6,0,28,14,102,89,22,2014-12
4,Aibonito,232,8,0,14,15,62,104,29,2014-12
...,...,...,...,...,...,...,...,...,...,...
10291,Vega Baja,155,5,1,9,14,38,74,14,2020-04
10292,Vieques,47,3,0,0,11,16,15,2,2020-04
10293,Villalba,14,0,0,1,5,3,5,0,2020-04
10294,Yabucoa,32,0,2,0,8,9,12,1,2020-04


In [29]:
df = df.sort_values(by='Date')

In [30]:
df.reset_index(inplace=True, drop=True)

In [31]:
df.isna().sum()

Distrito        0
Tipo I          0
Ases.         160
Viol.         312
Robo           56
Agr. Grave      5
Esc.            7
Apr. I          0
H. Auto        86
Date            0
dtype: int64

In [32]:
df[df['Ases.'].isna()]

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date
706,Culebra,58,,,,2,17,39,,2010-10
710,Guánica,137,,,10,10,45,67,5,2010-10
715,Hatillo,395,,,30,4,143,122,96,2010-10
721,Adjuntas,258,,,15,9,126,100,8,2010-10
726,Añasco,157,,,15,7,83,41,11,2010-10
...,...,...,...,...,...,...,...,...,...,...
8022,Maricao,26,,,,3,6,17,,2018-07
8025,Lares,98,,1,5,21,27,40,4,2018-07
8028,Naranjito,156,,,7,8,32,90,19,2018-07
8030,Patillas,54,,,2,18,8,26,,2018-07


In [33]:
pd.to_datetime(df['Date'])

0       2010-01-01
1       2010-01-01
2       2010-01-01
3       2010-01-01
4       2010-01-01
           ...    
10291   2020-12-01
10292   2020-12-01
10293   2020-12-01
10294   2020-12-01
10295   2020-12-01
Name: Date, Length: 10296, dtype: datetime64[ns]

In [34]:
df.dtypes
# Need to change some of the datatypes to be numeric

Distrito      object
Tipo I        object
Ases.         object
Viol.         object
Robo          object
Agr. Grave    object
Esc.          object
Apr. I        object
H. Auto       object
Date          object
dtype: object

In [35]:
df['Tipo I'] = pd.to_numeric(df['Tipo I'])
df['Ases.'] = pd.to_numeric(df['Ases.'])
df['Viol.'] = pd.to_numeric(df['Viol.'])
df['Robo'] = pd.to_numeric(df['Robo'])
df['Agr. Grave'] = pd.to_numeric(df['Agr. Grave'])
df['Esc.'] = pd.to_numeric(df['Esc.'])
df['Apr. I'] = pd.to_numeric(df['Apr. I'])
df['H. Auto'] = pd.to_numeric(df['H. Auto'])

In [36]:
df.dtypes

Distrito       object
Tipo I          int64
Ases.         float64
Viol.         float64
Robo          float64
Agr. Grave    float64
Esc.          float64
Apr. I          int64
H. Auto       float64
Date           object
dtype: object

In [37]:
df['Date'] = pd.to_datetime(df['Date'])

In [38]:
# Making sure we strip all distrito names of any whitespaces
df['Distrito'] = df['Distrito'].str.strip()

In [39]:
print(sorted(df['Distrito'].unique()))
print(len(df['Distrito'].unique()))
# All Distritos are unique

['Adjuntas', 'Aguada', 'Aguadilla', 'Aguas Buenas', 'Aibonito', 'Arecibo', 'Arroyo', 'Añasco', 'Barceloneta', 'Barranquitas', 'Bayamón', 'Cabo Rojo', 'Caguas', 'Camuy', 'Canóvanas', 'Carolina', 'Cataño', 'Cayey', 'Ceiba', 'Ciales', 'Cidra', 'Coamo', 'Comerío', 'Corozal', 'Culebra', 'Dorado', 'Fajardo', 'Florida', 'Guayama', 'Guayanilla', 'Guaynabo', 'Gurabo', 'Guánica', 'Hatillo', 'Hormigueros', 'Humacao', 'Isabela', 'Jayuya', 'Juana Díaz', 'Juncos', 'Lajas', 'Lares', 'Las Marías', 'Las Piedras', 'Loiza', 'Luquillo', 'Manatí', 'Maricao', 'Maunabo', 'Mayagüez', 'Moca', 'Morovis', 'Naguabo', 'Naranjito', 'Orocovis', 'Patillas', 'Peñuelas', 'Ponce', 'Quebradillas', 'Rincón', 'Rio Grande', 'Sabana Grande', 'Salinas', 'San Germán', 'San Juan', 'San Lorenzo', 'San Sebastián', 'Santa Isabel', 'Toa Alta', 'Toa Baja', 'Trujillo Alto', 'Utuado', 'Vega Alta', 'Vega Baja', 'Vieques', 'Villalba', 'Yabucoa', 'Yauco']
78


In [40]:
def fill_missing(x):
    
    # set nans from first row equal to 0
    # first = x.index[0]
    # x.loc[first] = x.loc[first].fillna(0)
    
    columns = ['Tipo I','Ases.','Viol.','Robo','Agr. Grave','Esc.','Apr. I','H. Auto']
    
    
    x[columns] = x[columns].interpolate() #first linear
    
    x[columns] = x[columns].interpolate(method='ffill') # then forward fill
    x[columns] = x[columns].interpolate(method='bfill') # then backward fill
    
    
    return x
    

df_interpolated = df.groupby(['Distrito', df.Date.dt.year]).apply(fill_missing)

In [41]:
df_interpolated.isna().sum()

Distrito      0
Tipo I        0
Ases.         0
Viol.         0
Robo          0
Agr. Grave    0
Esc.          0
Apr. I        0
H. Auto       0
Date          0
dtype: int64

In [42]:
df_interpolated

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date
0,Adjuntas,34,0.0,0.0,0.0,1.0,16.0,16,1.0,2010-01-01
1,Ponce,262,11.0,0.0,28.0,27.0,50.0,125,21.0,2010-01-01
2,Peñuelas,12,0.0,1.0,2.0,1.0,2.0,5,1.0,2010-01-01
3,Patillas,17,0.0,0.0,1.0,3.0,8.0,5,0.0,2010-01-01
4,Orocovis,34,0.0,0.0,0.0,2.0,25.0,6,1.0,2010-01-01
...,...,...,...,...,...,...,...,...,...,...
10291,Orocovis,75,1.0,1.0,1.0,11.0,23.0,35,3.0,2020-12-01
10292,Patillas,66,2.0,2.0,2.0,21.0,17.0,21,1.0,2020-12-01
10293,Ponce,754,24.0,3.0,32.0,186.0,125.0,365,18.0,2020-12-01
10294,Mayagüez,241,15.0,4.0,8.0,58.0,47.0,93,16.0,2020-12-01


In [43]:
df_interpolated.dtypes

Distrito              object
Tipo I                 int64
Ases.                float64
Viol.                float64
Robo                 float64
Agr. Grave           float64
Esc.                 float64
Apr. I                 int64
H. Auto              float64
Date          datetime64[ns]
dtype: object

In [44]:
df_interpolated['Ases.'] = df_interpolated['Ases.'].astype(int)
df_interpolated['Viol.'] = df_interpolated['Viol.'].astype(int)
df_interpolated['Robo'] = df_interpolated['Robo'].astype(int)
df_interpolated['Agr. Grave'] = df_interpolated['Agr. Grave'].astype(int)
df_interpolated['Esc.'] = df_interpolated['Esc.'].astype(int)
df_interpolated['H. Auto'] = df_interpolated['H. Auto'].astype(int)

In [45]:
df_interpolated

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date
0,Adjuntas,34,0,0,0,1,16,16,1,2010-01-01
1,Ponce,262,11,0,28,27,50,125,21,2010-01-01
2,Peñuelas,12,0,1,2,1,2,5,1,2010-01-01
3,Patillas,17,0,0,1,3,8,5,0,2010-01-01
4,Orocovis,34,0,0,0,2,25,6,1,2010-01-01
...,...,...,...,...,...,...,...,...,...,...
10291,Orocovis,75,1,1,1,11,23,35,3,2020-12-01
10292,Patillas,66,2,2,2,21,17,21,1,2020-12-01
10293,Ponce,754,24,3,32,186,125,365,18,2020-12-01
10294,Mayagüez,241,15,4,8,58,47,93,16,2020-12-01


In [46]:
# df[df['Distrito'] == 'Vega Baja']['Tipo I'].diff()

In [47]:
# Exporting both interpolated and non interpolated versions of the dataframe
df.to_csv('./data/clean/DelitosTipo1/DelitosTipo1-2010-2020.csv')
df_interpolated.to_csv('./data/clean/DelitosTipo1/DelitosTipo1-2010-2020(interpolado).csv')

In [48]:
df_interpolated['Date'] = pd.to_datetime(df_interpolated['Date'])

In [49]:
# Turning data from cumulative into monthly deltas.
def year_to_month(x):
    # Turning into monthly deltas
    cols =['Tipo I','Ases.', 'Viol.', 'Robo', 'Agr. Grave','Esc.', 'Apr. I', 'H. Auto']
    
    x[cols] = x[cols].diff()
    
    return x

df_interpolated_deltas = df_interpolated.groupby(['Distrito', df_interpolated.Date.dt.year]).apply(year_to_month)

In [50]:
df_interpolated_deltas = df_interpolated_deltas.groupby(['Distrito']).apply(fill_missing)

In [51]:
df_interpolated_deltas

Unnamed: 0,Distrito,Tipo I,Ases.,Viol.,Robo,Agr. Grave,Esc.,Apr. I,H. Auto,Date
0,Adjuntas,34.0,0.0,0.0,4.0,1.0,12.0,16.0,1.0,2010-01-01
1,Ponce,243.0,4.0,0.0,26.0,14.0,52.0,123.0,24.0,2010-01-01
2,Peñuelas,12.0,0.0,0.0,2.0,0.0,6.0,2.0,2.0,2010-01-01
3,Patillas,13.0,0.0,0.0,0.0,4.0,6.0,3.0,0.0,2010-01-01
4,Orocovis,22.0,0.0,0.0,0.0,0.0,10.0,12.0,0.0,2010-01-01
...,...,...,...,...,...,...,...,...,...,...
10291,Orocovis,11.0,0.0,0.0,0.0,2.0,5.0,3.0,1.0,2020-12-01
10292,Patillas,8.0,0.0,0.0,0.0,6.0,1.0,1.0,0.0,2020-12-01
10293,Ponce,71.0,0.0,1.0,6.0,10.0,13.0,40.0,1.0,2020-12-01
10294,Mayagüez,34.0,2.0,0.0,1.0,6.0,6.0,18.0,1.0,2020-12-01


In [52]:
df_interpolated_deltas.isna().sum()

Distrito      0
Tipo I        0
Ases.         0
Viol.         0
Robo          0
Agr. Grave    0
Esc.          0
Apr. I        0
H. Auto       0
Date          0
dtype: int64

In [53]:
df_interpolated_deltas.to_csv('./data/clean/DelitosTipo1/DelitosTipo1-2010-2020_deltas_mensuales(interpolado).csv')

### Unemployment Rate

In [54]:
import numpy as np
import pandas as pd

In [55]:
datasetPersDespl = pd.read_csv("./data/raw/Desempleos/personas_desempleadas_municipal.csv",encoding='iso8859_3')
datasetPersDespl = datasetPersDespl.drop(index = [0])
datasetPersDespl.drop(datasetPersDespl.tail(295).index,inplace = True)
datasetPersDespl

Unnamed: 0,Date,Puerto Rico SA,Puerto Rico NSA,Adjuntas NSA,Aguada NSA,Aguadilla NSA,Aguas Buenas NSA,Aibonito NSA,Añasco NSA,Arecibo NSA,...,"Coco, Salinas, PR MicroSA NSA","Jayuya, PR MicroSA NSA","Santa Isabel, PR MicroSA NSA","Aguadilla-Isabela, PR MSA NSA","Arecibo, PR MSA NSA","Guayama, PR MSA NSA","Mayaguez, PR MSA NSA","Ponce, PR MSA NSA","San Germán, PR MSA NSA","San Juan-Carolina-Caguas, PR MSA NSA"
1,2020-08,86123,88480,417.0,978.0,1261.0,594.0,517.0,642.0,1974.0,...,669.0,304.0,570.0,7372.0,4216.0,1523.0,2075.0,7589.0,2888.0,59237.0
2,2020-07,87804,75757,348.0,854.0,1115.0,516.0,451.0,547.0,1767.0,...,560.0,253.0,474.0,6410.0,3728.0,1320.0,1812.0,6437.0,2418.0,50612.0
3,2020-06,92311,90194,395.0,1058.0,1307.0,629.0,527.0,671.0,2167.0,...,648.0,282.0,515.0,7708.0,4555.0,1582.0,2225.0,7538.0,2878.0,60277.0
4,2020-05,93122,97038,435.0,1220.0,1397.0,673.0,545.0,741.0,2352.0,...,694.0,267.0,530.0,8660.0,5078.0,1677.0,2610.0,8563.0,3455.0,63425.0
5,2020-02,92713,80739,763.0,1096.0,1232.0,550.0,583.0,689.0,2054.0,...,876.0,515.0,744.0,9190.0,5368.0,2375.0,2316.0,9982.0,3431.0,43319.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,2000-12,127800,114426,567.0,1687.0,2401.0,730.0,802.0,1000.0,3058.0,...,998.0,495.0,820.0,12488.0,6030.0,2609.0,3816.0,12913.0,4469.0,67176.0
236,2000-11,124955,114038,516.0,1686.0,2348.0,734.0,773.0,980.0,3157.0,...,1036.0,504.0,850.0,12382.0,6165.0,2664.0,3864.0,12974.0,4548.0,66478.0
237,2000-10,123946,127720,560.0,1853.0,2616.0,818.0,850.0,1032.0,3493.0,...,1119.0,589.0,932.0,13738.0,6817.0,2975.0,4252.0,14430.0,5014.0,75071.0
238,2000-09,124514,123815,550.0,1842.0,2562.0,780.0,819.0,1145.0,3423.0,...,1116.0,563.0,911.0,13609.0,6581.0,2991.0,4314.0,13971.0,5083.0,71936.0


In [56]:
datasetTasaDespl = pd.read_csv("./data/raw/Desempleos/tasa_desempleo_municipal.csv",encoding='iso8859_3')
datasetTasaDespl = datasetTasaDespl.drop(index=[0])
datasetTasaDespl.drop(datasetTasaDespl.tail(293).index,inplace = True)

datasetTasaDespl

Unnamed: 0,Date,Puerto Rico SA,Puerto Rico NSA,Adjuntas NSA,Aguada NSA,Aguadilla NSA,Aguas Buenas NSA,Aibonito NSA,Añasco NSA,Arecibo NSA,...,"Coco, Salinas, PR MicroSA NSA","Jayuya, PR MicroSA NSA","Santa Isabel, PR MicroSA NSA","Aguadilla-Isabela, PR MSA NSA","Arecibo, PR MSA NSA","Guayama, PR MSA NSA","Mayaguez, PR MSA NSA","Ponce, PR MSA NSA","San Germán, PR MSA NSA","San Juan-Carolina-Caguas, PR MSA NSA"
1,8/1/2020,8.3,8.4,10.8,8.7,9.1,9.5,8.5,7.6,8.3,...,10.1,7.9,7.7,9.1,8.1,8.2,8.1,8.6,8.8,8.3
2,7/1/2020,8.5,7.3,9.4,7.6,8.1,8.3,7.5,6.6,7.5,...,8.6,6.8,6.8,8.0,7.2,7.2,7.1,7.4,7.5,7.1
3,6/1/2020,8.9,8.5,10.4,9.2,9.3,9.9,8.7,7.9,9.0,...,9.9,7.4,6.5,9.4,8.7,8.4,8.6,8.5,8.7,8.3
4,5/1/2020,9.0,9.6,11.7,11.1,10.5,11.1,9.4,9.1,10.4,...,10.9,7.4,6.9,11.0,10.2,9.3,10.4,10.0,10.9,9.2
5,4/1/2020,8.8,7.8,18.4,9.7,9.0,8.8,9.6,8.2,9.0,...,13.3,12.7,8.4,11.2,10.5,12.6,9.1,11.2,10.4,6.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
237,12/1/2000,9.7,10.1,11.4,13.5,13.3,9.6,11.3,11.1,10.9,...,13.9,13.0,12.7,12.9,10.9,13.2,11.4,12.6,11.2,9.0
238,11/1/2000,9.7,9.7,11.1,13.3,12.9,9.1,10.9,12.1,10.6,...,13.7,12.6,12.8,12.7,10.4,13.2,11.5,12.3,11.3,8.6
239,10/1/2000,9.8,11.2,13.2,15.1,15.0,10.8,12.9,13.6,11.8,...,15.5,14.6,14.6,14.7,11.9,14.8,13.0,14.0,13.2,9.9
240,9/1/2000,9.9,10.0,12.3,13.1,13.7,10.1,11.8,11.8,10.5,...,14.0,13.1,12.7,13.3,10.6,13.3,11.5,12.5,11.6,8.8


In [57]:
datasetPersEmpleadas = pd.read_csv("./data/raw/Desempleos/personas_empleadas_municipal.csv",encoding='iso8859_3')
datasetPersEmpleadas = datasetPersEmpleadas.drop(index=[0])
datasetPersEmpleadas.drop(datasetPersEmpleadas.tail(293).index,inplace = True)

datasetPersEmpleadas

Unnamed: 0,Date,Puerto Rico SA,Puerto Rico NSA,Adjuntas NSA,Aguada NSA,Aguadilla NSA,Aguas Buenas NSA,Aibonito NSA,Añasco NSA,Arecibo NSA,...,"Coco, Salinas, PR MicroSA NSA","Jayuya, PR MicroSA NSA","Santa Isabel, PR MicroSA NSA","Aguadilla-Isabela, PR MSA NSA","Arecibo, PR MSA NSA","Guayama, PR MSA NSA","Mayaguez, PR MSA NSA","Ponce, PR MSA NSA","San Germán, PR MSA NSA","San Juan-Carolina-Caguas, PR MSA NSA"
1,8/1/2020,951707,960458,3461.0,10303.0,12566.0,5641.0,5594.0,7762.0,21873.0,...,5958.0,3544.0,6834.0,73995.0,48061.0,16939.0,23397.0,80300.0,29805.0,652840.0
2,7/1/2020,946949,966381,3361.0,10347.0,12581.0,5687.0,5539.0,7792.0,21828.0,...,5943.0,3478.0,6504.0,73834.0,47972.0,17036.0,23552.0,80637.0,29710.0,659080.0
3,6/1/2020,942830,973196,3393.0,10442.0,12697.0,5717.0,5511.0,7870.0,21812.0,...,5923.0,3511.0,7348.0,74320.0,47974.0,17167.0,23639.0,80994.0,30209.0,663268.0
4,5/1/2020,939024,918881,3281.0,9799.0,11930.0,5403.0,5230.0,7383.0,20286.0,...,5681.0,3364.0,7156.0,69906.0,44645.0,16333.0,22371.0,76731.0,28263.0,626287.0
5,4/1/2020,956078,957316,3375.0,10200.0,12474.0,5680.0,5519.0,7693.0,20820.0,...,5689.0,3531.0,8088.0,73009.0,45735.0,16492.0,23192.0,78899.0,29413.0,654593.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
237,12/1/2000,1155181,1138016,4331.0,11849.0,17109.0,7741.0,6641.0,8262.0,28476.0,...,6936.0,3936.0,6418.0,92756.0,55983.0,19533.0,32994.0,100313.0,39715.0,758365.0
238,11/1/2000,1159585,1146273,4423.0,11978.0,17294.0,7820.0,6710.0,8352.0,28766.0,...,7051.0,3895.0,6231.0,93530.0,56554.0,19685.0,33146.0,99412.0,40035.0,765566.0
239,10/1/2000,1163225,1131883,4349.0,11720.0,16922.0,7727.0,6629.0,8172.0,28421.0,...,6994.0,3779.0,6190.0,91591.0,55875.0,19559.0,32731.0,98493.0,39564.0,756275.0
240,9/1/2000,1166533,1170012,4511.0,12173.0,17577.0,7979.0,6845.0,8488.0,29349.0,...,7237.0,3878.0,6479.0,95041.0,57699.0,20076.0,33983.0,101444.0,40909.0,780914.0


In [58]:
datasetGrupoTrabajador = pd.read_csv("./data/raw/Desempleos/personas_en_grupo_trabajador_municipal.csv",encoding='iso8859_3')
datasetGrupoTrabajador = datasetGrupoTrabajador.drop(index=[0])
datasetGrupoTrabajador.drop(datasetGrupoTrabajador.tail(295).index,inplace = True)

datasetGrupoTrabajador

Unnamed: 0,Date,Puerto Rico SA,Puerto Rico NSA,Adjuntas NSA,Aguada NSA,Aguadilla NSA,Aguas Buenas NSA,Aibonito NSA,Añasco NSA,Arecibo NSA,...,"Coco, Salinas, PR MicroSA NSA","Jayuya, PR MicroSA NSA","Santa Isabel, PR MicroSA NSA","Aguadilla-Isabela, PR MSA NSA","Arecibo, PR MSA NSA","Guayama, PR MSA NSA","Mayaguez, PR MSA NSA","Ponce, PR MSA NSA","San Germán, PR MSA NSA","San Juan-Carolina-Caguas, PR MSA NSA"
1,2020-08,1037830,1048938,3878.0,11281.0,13827.0,6235.0,6111.0,8404.0,23847.0,...,6627.0,3848.0,7404.0,81367.0,52277.0,18462.0,25472.0,87889.0,32693.0,712077.0
2,2020-07,1034753,1042138,3709.0,11201.0,13696.0,6203.0,5990.0,8339.0,23595.0,...,6503.0,3731.0,6978.0,80244.0,51700.0,18356.0,25364.0,87074.0,32128.0,709692.0
3,2020-06,1035141,1063390,3788.0,11500.0,14004.0,6346.0,6038.0,8541.0,23979.0,...,6571.0,3793.0,7863.0,82028.0,52529.0,18749.0,25864.0,88532.0,33087.0,723545.0
4,2020-05,1032146,1015919,3716.0,11019.0,13327.0,6076.0,5775.0,8124.0,22638.0,...,6375.0,3631.0,7686.0,78566.0,49723.0,18010.0,24981.0,85294.0,31718.0,689712.0
5,2020-02,1048791,1038055,4138.0,11296.0,13706.0,6230.0,6102.0,8382.0,22874.0,...,6565.0,4046.0,8832.0,82199.0,51103.0,18867.0,25508.0,88881.0,32844.0,697912.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,2000-12,1271642,1280556,4918.0,13807.0,19900.0,8672.0,7616.0,9450.0,32275.0,...,8069.0,4428.0,7746.0,107241.0,63468.0,22594.0,37583.0,115311.0,45111.0,845088.0
236,2000-11,1274914,1274382,5003.0,13842.0,19901.0,8614.0,7533.0,9456.0,32142.0,...,8046.0,4579.0,7596.0,107534.0,63148.0,22509.0,37644.0,115471.0,45312.0,838129.0
237,2000-10,1279127,1265736,4891.0,13702.0,19725.0,8559.0,7491.0,9294.0,31969.0,...,8055.0,4525.0,7350.0,106494.0,62800.0,22508.0,37246.0,114743.0,44729.0,833436.0
238,2000-09,1284099,1270088,4973.0,13820.0,19856.0,8600.0,7529.0,9497.0,32189.0,...,8167.0,4458.0,7142.0,107139.0,63135.0,22676.0,37460.0,113383.0,45118.0,837502.0


First melting the data of each dataframe to turn the district and city columns into rows.

In [59]:
df_tasa_melt = pd.melt(datasetTasaDespl,
        id_vars=['Date'],
        var_name='Municipio o Area',
        value_name='Tasa de Desempleo'
       )

In [60]:
df_desempleados_melt = pd.melt(datasetPersDespl,
        id_vars=['Date'],
        var_name='Municipio o Area',
        value_name='Num. Personas Desempleadas'
       )

In [61]:
df_empleados_melt = pd.melt(datasetPersEmpleadas,
        id_vars=['Date'],
        var_name='Municipio o Area',
        value_name='Num. Personas Empleadas'
       )

In [62]:
df_grupo_trabajador_melt = pd.melt(datasetGrupoTrabajador,
        id_vars=['Date'],
        var_name='Municipio o Area',
        value_name='Num. Personas Grupo Trabajador'
       )

Performing datetime conversion on all of these

In [63]:
df_tasa_melt['Date'] = pd.to_datetime(df_tasa_melt.Date)
df_desempleados_melt['Date'] = pd.to_datetime(df_desempleados_melt.Date)
df_empleados_melt['Date'] = pd.to_datetime(df_empleados_melt.Date)
df_grupo_trabajador_melt['Date'] = pd.to_datetime(df_grupo_trabajador_melt.Date)

In [64]:
df_tasa_melt

Unnamed: 0,Date,Municipio o Area,Tasa de Desempleo
0,2020-08-01,Puerto Rico SA,8.3
1,2020-07-01,Puerto Rico SA,8.5
2,2020-06-01,Puerto Rico SA,8.9
3,2020-05-01,Puerto Rico SA,9.0
4,2020-04-01,Puerto Rico SA,8.8
...,...,...,...
22890,2000-12-01,"San Juan-Carolina-Caguas, PR MSA NSA",9.0
22891,2000-11-01,"San Juan-Carolina-Caguas, PR MSA NSA",8.6
22892,2000-10-01,"San Juan-Carolina-Caguas, PR MSA NSA",9.9
22893,2000-09-01,"San Juan-Carolina-Caguas, PR MSA NSA",8.8


In [65]:
df_desempleados_melt

Unnamed: 0,Date,Municipio o Area,Num. Personas Desempleadas
0,2020-08-01,Puerto Rico SA,86123.0
1,2020-07-01,Puerto Rico SA,87804.0
2,2020-06-01,Puerto Rico SA,92311.0
3,2020-05-01,Puerto Rico SA,93122.0
4,2020-02-01,Puerto Rico SA,92713.0
...,...,...,...
22700,2000-12-01,"San Juan-Carolina-Caguas, PR MSA NSA",67176.0
22701,2000-11-01,"San Juan-Carolina-Caguas, PR MSA NSA",66478.0
22702,2000-10-01,"San Juan-Carolina-Caguas, PR MSA NSA",75071.0
22703,2000-09-01,"San Juan-Carolina-Caguas, PR MSA NSA",71936.0


In [66]:
df_empleados_melt

Unnamed: 0,Date,Municipio o Area,Num. Personas Empleadas
0,2020-08-01,Puerto Rico SA,951707.0
1,2020-07-01,Puerto Rico SA,946949.0
2,2020-06-01,Puerto Rico SA,942830.0
3,2020-05-01,Puerto Rico SA,939024.0
4,2020-04-01,Puerto Rico SA,956078.0
...,...,...,...
22890,2000-12-01,"San Juan-Carolina-Caguas, PR MSA NSA",758365.0
22891,2000-11-01,"San Juan-Carolina-Caguas, PR MSA NSA",765566.0
22892,2000-10-01,"San Juan-Carolina-Caguas, PR MSA NSA",756275.0
22893,2000-09-01,"San Juan-Carolina-Caguas, PR MSA NSA",780914.0


In [67]:
df_grupo_trabajador_melt

Unnamed: 0,Date,Municipio o Area,Num. Personas Grupo Trabajador
0,2020-08-01,Puerto Rico SA,1037830.0
1,2020-07-01,Puerto Rico SA,1034753.0
2,2020-06-01,Puerto Rico SA,1035141.0
3,2020-05-01,Puerto Rico SA,1032146.0
4,2020-02-01,Puerto Rico SA,1048791.0
...,...,...,...
22700,2000-12-01,"San Juan-Carolina-Caguas, PR MSA NSA",845088.0
22701,2000-11-01,"San Juan-Carolina-Caguas, PR MSA NSA",838129.0
22702,2000-10-01,"San Juan-Carolina-Caguas, PR MSA NSA",833436.0
22703,2000-09-01,"San Juan-Carolina-Caguas, PR MSA NSA",837502.0


Performing a merge of all these attributes into a single dataframe.

In [68]:
df_final = pd.merge(df_tasa_melt, df_desempleados_melt)
df_final = pd.merge(df_final, df_empleados_melt)
df_final = pd.merge(df_final, df_grupo_trabajador_melt)

In [69]:
df_final.isna().sum()

Date                              0
Municipio o Area                  0
Tasa de Desempleo                 0
Num. Personas Desempleadas        0
Num. Personas Empleadas           0
Num. Personas Grupo Trabajador    0
dtype: int64

In [70]:
df_final.dtypes

Date                              datetime64[ns]
Municipio o Area                          object
Tasa de Desempleo                        float64
Num. Personas Desempleadas               float64
Num. Personas Empleadas                  float64
Num. Personas Grupo Trabajador           float64
dtype: object

### Stripping out any whitespaces from names and also removing the NSA abreviation from it (NSA stands for None-Seasonally Adjustes)

For the sake of simplicity in our analysis we will suppose that None-Seasonally Adjusted values are the the samae as normal values. It is good to keep this in mind since about 13 of the cities aren't NSA and could provide differing results.

In [71]:
df_final['Municipio o Area'] = df_final[
    'Municipio o Area'].str.replace(" NSA", "")

In [72]:
df_final['Municipio o Area'] = df_final[
    'Municipio o Area'].str.strip()

In [73]:
# Changing Mayaguez to Mayagüez
df_final.loc[df_final['Municipio o Area'] == 'Mayaguez',
             'Municipio o Area'] = 'Mayagüez'

In [74]:
# Changing Juana Diaz to Juana Díaz
df_final.loc[df_final['Municipio o Area'] == 'Juana Diaz',
             'Municipio o Area'] = 'Juana Díaz'

In [75]:
df_final['Municipio o Area'].unique()

array(['Puerto Rico SA', 'Puerto Rico', 'Adjuntas', 'Aguada', 'Aguadilla',
       'Aguas Buenas', 'Aibonito', 'Añasco', 'Arecibo', 'Arroyo',
       'Barceloneta', 'Barranquitas', 'Bayamón', 'Cabo Rojo', 'Caguas',
       'Camuy', 'Canóvanas', 'Carolina', 'Cataño', 'Cayey', 'Ceiba',
       'Ciales', 'Cidra', 'Coamo', 'Comerío', 'Corozal', 'Culebra',
       'Dorado', 'Fajardo', 'Florida', 'Guánica', 'Guayama', 'Guayanilla',
       'Guaynabo', 'Gurabo', 'Hatillo', 'Hormigueros', 'Humacao',
       'Isabela', 'Jayuya', 'Juana Díaz', 'Juncos', 'Lajas', 'Lares',
       'Las Marías', 'Las Piedras', 'Loiza', 'Luquillo', 'Manatí',
       'Maricao', 'Maunabo', 'Mayagüez', 'Moca', 'Morovis', 'Naguabo',
       'Naranjito', 'Orocovis', 'Patillas', 'Peñuelas', 'Ponce',
       'Quebradillas', 'Rincón', 'Rio Grande', 'Sabana Grande', 'Salinas',
       'San Germán', 'San Juan', 'San Lorenzo', 'San Sebastián',
       'Santa Isabel', 'Toa Alta', 'Toa Baja', 'Trujillo Alto', 'Utuado',
       'Vega Alta', 'V

In [76]:
df_final['Municipio o Area'].unique().size

95

In [77]:
df_final.to_csv(
    "./data/clean/Tasa de Desempleos/tasa_de_desempleos_y_mas_limpio.csv",
    index=False
)