![logo.png](logo.png)

 
# Data Science - Machine Learning Challenge


### Background

Assume you have been selected to help the Chicago Police Department build the machine learning services which will power their next generation of mobile crime analytics software. This software aims in particular at predicting, in real time, the category of a crime as soon as it is being reported by an emergency call (for instance 'robbery', 'assault', 'theft'). This prediction can only be made with information available at the time of the call (such as time and location) without on-the-ground assessment or knowledge of ex post action (such as arrest, conviction, demographics of victim(s) or offender(s)).

### Data

The dataset you have at your disposal (`chicago_crimes_data_2010_2017.csv`) contains a random sample of all reported incidents of crime that occurred in the City of Chicago between 2010 and 2017 . The columns in the csv file include:

- **ID:** Unique identifier for the record

- **Case Number:** The Chicago Police Department RD Number (Records Division Number), which is unique to the incident

- **Block:** The partially redacted address where the incident occurred, placing it on the same block as the actual address

- **IUCR:** The Illinois Uniform Crime Reporting code. This is directly linked to the Primary Type and Description. You can find the list of IUCR codes at ![](https://data.cityofchicago.org/d/c7ck-438e)

- **Primary Type:** The primary description of the IUCR code

- **Description:** The secondary description of the IUCR code, a subcategory of the primary description.

- **Location Description:** Description of the location where the incident occurred.

- **Arrest:** Indicates whether an arrest was made

- **Domestic:** Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act

- **Beat:** Indicates the beat where the incident occurred. A beat is the smallest police geographic area - each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts

- **District:** Indicates the police district where the incident occurred

- **Ward:** The ward (City Council district) where the incident occurred

- **Community Area:** Indicates the community area where the incident occurred. Chicago has 77 community areas

- **FBI Code:** Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS)

- **Latitude:** The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block

- **Longitude:** The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block

The dataset has already been split into training and test set (`cf. train_or_test` column)

## Your assignment

You have 5 hours to explore the data and tackle the problem using machine learning.

### Part A-Binary classification

Build a model that can predict whether or not the crime is a **'THEFT'** (identified in the **Primary Type** column) given a relevant set of features at your disposal. Please explain your choice of features in light of the use case highlighted above. Use the training data to train the model and discuss its performance on the test data.

**Please answer the following questions:**

1. What is the accuracy of a naïve model that would always guess 'THEFT' and what is the accuracy of your model?

2. Are there any other metrics that you have computed to assess the performance of your model? If yes, discuss their values.

3. What approach did you use and why?

4. How would you improve your model if you had another hour / another week at your disposal?

Remark: If you did any data exploration and pre-processing, please explain their rational and your observations.

###  Part B (Optional) - Multiclass classification

Build a model that can predict the **Primary Type** of the crime given a relevant set of features at your disposal, and answer questions $2,3,4$ above.

### Output

Your returned assignment should include:

- Working code which solves the problem

- A document with your answers to the questions

- These 2 can be combined if using Jupyter notebook or RMarkdown for instance

### Evaluation

You will be evaluated on (in order of decreasing importance):

- Returning the assignment on time, even if you're not done

- The accuracy of your predictions on the test set

- The technical soundness and thoroughness of your work

- How well you describe your methods and future work

### Final remarks

- You are free to use whatever tools you are most comfortable with to work through the analysis
 

In [None]:
!git clone --branch sonder_1 https://github.com/interviewquery/takehomes.git
%cd takehomes/sonder_1
!if [[ $(ls *.zip) ]]; then unzip *.zip; fi
!ls

In [None]:
# Write your code here