# CS105 Final Project 
- completed by Naveen Joby, Apar Mistry, and Arhum Shahid

This project aims to analyze marijuana arrests within the City of Los Angeles, dating from 2010 onwards. We plan to use pearson’s correlation, and k-nearest neighbors to find correlations between age, sex, race (descent code), and area (based off the 21 Community Police Stations). We would also like to try and figure out charges based on sex, age, and area. Moreover, based on certain ages and locations, we can try and see if certain areas are more likely for a specific crime related to marijuana.

In [None]:
#$python3 install pandas-bokeh

%pip install pandas_bokeh

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

import pandas_bokeh

basedf = pd.read_csv('./Marijuana_Data.csv')
basedf.head()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pandas_bokeh
  Downloading pandas_bokeh-0.5.5-py2.py3-none-any.whl (29 kB)
Installing collected packages: pandas-bokeh
Successfully installed pandas-bokeh-0.5.5


FileNotFoundError: ignored

# New Section

**Data Cleaning**

In [None]:
#removing space from column names
basedf = basedf.rename(columns={"Report ID": "ReportID", "Sex Code": "SexCode",
                                 "Arrest Date": "ArrestDate", "Area ID": "AreaID", "Reporting District": "ReportingDistrict", "Descent Code": "DescentCode", "Charge Description": "ChargeDescription"})
basedf.head()

**Gender and Age Distribution in Each Area**

In [None]:
newtable = pd.crosstab(basedf['AreaID'], basedf['SexCode'])
newtable


In [None]:
newtable.plot.bar(stacked=True)

With this visualization, we notice that a staggering amount of the arestees are male across the 21 police stations. However, this doesn't grant us a lot of information about arrest data. Based of a quick Google search, the ratio of male to female is about 97 men : 100 women. So, the copius amounts of men being arrested isn't because there is a higher population of men in LA than women. Do men tend to possess more marijuana than women? Is it easier for women to get out of arrests than men? Without other information regarding these specific arrests (which is difficult to find due to privacy laws), we cannot make a conclusive analylsis about the gender ratio in marijuana arrests. 

Now, we can look at the age correlation with data. The data includes people as young as 11, all the way to 79. Since there's such a high amount of ages, we decided to group it using a AgeCode instead.
Anyone less than 18 would have a code of 1, 18-30 would be 2, 31-40 would be 3,41-50 would be 4, 51-60 would be 5, 61-70 would be 6, and anything older would be 7. 

In [None]:
import warnings
warnings.filterwarnings('ignore')
basedf['AgeCode'] = 0
for ind in basedf.index:
     if(basedf['Age'][ind] < 18):
         basedf['AgeCode'][ind] = 1
     elif(basedf['Age'][ind] >= 18 and basedf['Age'][ind] < 30):
         basedf['AgeCode'][ind] = 2
     elif(basedf['Age'][ind] >= 30 and basedf['Age'][ind] < 40):
         basedf['AgeCode'][ind] = 3
     elif(basedf['Age'][ind] >= 40 and basedf['Age'][ind] < 50):
         basedf['AgeCode'][ind] = 4
     elif(basedf['Age'][ind] >= 50 and basedf['Age'][ind] < 60):
         basedf['AgeCode'][ind] = 5
     elif(basedf['Age'][ind] >= 60 and basedf['Age'][ind] < 70):
         basedf['AgeCode'][ind] = 6
     else:
         basedf['AgeCode'][ind] = 7
newtable = pd.crosstab(basedf['AreaID'], basedf['AgeCode'])
newtable.plot.bar(stacked=True)

As we can see in the visualization above, age code 2, which is from 18-30, seems to be the most common age for arrestees in all 21 police stations. To add, the second most common age seems to be 31-40, which is age code 3. This makes sense, as the people who tend to use marijuana products (and thereby get arrested for them) are usually in these age groups. There is also a small percentage of all the other age codes, the least common ones being age codes 6 and 7. This makes sense because people aged > 61 are very unlikely to get arrested for marijuana possession. Another thing that we noticed is how people younger than 18 tended to get arrested for marijuana related crimes. Although we did notice that there were arestees as young as 11 and 12, We didn't expect a huge amount of minors to be indicted for marijuana charges. 

Next, we will look at the relationship between the area and decent code for the arrestees. 

**Descent Distribution**

In [None]:
# pandas_bokeh.output_notebook()
# RaceTab = pd.crosstab(basedf['DescentCode'], basedf['AreaID'])
# RaceTab.plot.bar(stacked=True).legend(loc= 'best')
#max_elements.plot.bar()

newtable = pd.crosstab(basedf['AreaID'], basedf['DescentCode'])
newtable

This table shows the descent code categorization by each Area code. The LA county race categorization is one that doesn't make much sense. 


The categories are as follows:

A - Other Asian B - Black C - Chinese D - Cambodian F - Filipino G - Guamanian H - Hispanic/Latin/Mexican I - American Indian/Alaskan Native J - Japanese K - Korean L - Laotian O - Other P - Pacific Islander S - Samoan U - Hawaiian V - Vietnamese W - White X - Unknown Z - Asian Indian

So, we decided to group up Asian countries. Now, they have a tag for A - Asian. We also decided to group together Samoa (S) and Hawaii (U) with Pacific Islanders (P). Our new grouping would be:

A - Asian, B - Black, G - Guamanian, H - Hispanic/Latin/Mexican, I - American Indian/Alaskan Native, O - Other, P - Pacific Islander, W - White, X - Unknown 

In [None]:
# use pie chart, easier to explain why we're disregarding the other races (so small that it won't matter)
basedf['RaceCode'] = 0
for ind in basedf.index:
     # asian countries
     if(basedf['DescentCode'][ind] == 'C'):
         basedf['RaceCode'][ind] = 'A'
     elif(basedf['DescentCode'][ind] == 'D'):
         basedf['RaceCode'][ind] = 'A'
     elif(basedf['DescentCode'][ind] == 'F'):
         basedf['RaceCode'][ind] = 'A'
     elif(basedf['DescentCode'][ind] == 'J'):
         basedf['RaceCode'][ind] = 'A'
     elif(basedf['DescentCode'][ind] == 'K'):
         basedf['RaceCode'][ind] = 'A'
     elif(basedf['DescentCode'][ind] == 'L'):
         basedf['RaceCode'][ind] = 'A'
     elif(basedf['DescentCode'][ind] == 'V'):
         basedf['RaceCode'][ind] = 'A'
     elif(basedf['DescentCode'][ind] == 'Z'):
         basedf['RaceCode'][ind] = 'A'
     # pacific islander
     elif(basedf['DescentCode'][ind] == 'S'):
         basedf['RaceCode'][ind] = 'P'
     elif(basedf['DescentCode'][ind] == 'U'):
         basedf['RaceCode'][ind] = 'P'
     else:
        basedf['RaceCode'][ind] = basedf['DescentCode'][ind]

explode = [0, 0, 0, 0, 0.2, 0.3, 0.4, 0.5, 0.6]
basedf['RaceCode'].value_counts().plot.pie(explode = explode)

This pie chart shows the percentages of different races that were arrested for marijuana related crimes. As shown in the chart, hispanic/latin/mexican arrests(H) and black arrests (B) made up a huge portion of the pie, well over 2/3. White (W) arrests and others (O) also made a sizable portion. Since many Asian countries had very little amount of arrestees, we chose to group all of the Asian countries under "A," and it was still nearly impossible to see anything in the visualization because of such low percentages. We did the same grouping for Pacific Islanders, grouping Samoa and Hawaii. Once again, it was still nearly impossible to see in the visualization. This data doesn't really show us much, but provides insight as to what trends we can view.

**KNN-Predicting Charge Based off of Area Code**

With this dataset, our primary goal was to see if certain areas are more likely for a specific crime related to marijuana. Although all the detainees were arrested due to marijuana related crimes, not all of the charges are the same.

In [None]:
basedf['ChargeDescription'].head(10)

The first 10 charge descriptions show varying charges for each person. This allows us to perform KNN to try and classify what charge a person would most likely be convicted of based on each area of the 21 Community Police Stations.

First, we want to do some data cleaning. There are a couple of rows in which there are no charge descriptions. Since these values cannot be used (as they'll affect the outcome), we decided to change all of these values to "Marijuana Related Crimes".

In [None]:
change = ['ChargeDescription']

for column in change:
  basedf[column] = basedf[column].replace(np.NaN, "Marijuana Related Crimes")

print(basedf['ChargeDescription'])  

Now, we want to change these descriptions to numbers so that we can perform KNN.

In [None]:
basedf['ChargeCode'] = 0
for ind in basedf.index:
     # asian countries
     if(basedf['ChargeDescription'][ind] == "POSSESS MARIJUANA FOR SALE"):
         basedf['ChargeCode'][ind] = '1'
     elif(basedf['ChargeDescription'][ind] == "SALE/OFFER TO SELL/TRANSPORT MARIJUANA"):
         basedf['ChargeCode'][ind] = '2'
     elif(basedf['ChargeDescription'][ind] == "TRANSPORT/SELL/FURNISH/ETC MARIJUANA"):
         basedf['ChargeCode'][ind] = '3'
     elif(basedf['ChargeDescription'][ind] == "SMOKE/INGEST MARIJUANA IN PUBLIC PLACE"):
         basedf['ChargeCode'][ind] = '4'
    
    # FINISH THE REST

     else:
         basedf['ChargeCode'][ind] = '-1'

In [None]:
basedf.head(20)