# Cleaning and Preprocessing Crime Data
The provided data is in CSV format, containing columns such as `Category`, `Latitude`, `Longitude`, `Location Name`, and `Date`. The first step in training a model to provide crime statistics and identify areas prone to crime is to clean and preprocess the data. This would involve removing any unnecessary columns, converting categorical variables (such as type of crime) into numerical ones, and dealing with missing or invalid data.

To achieve this, we can use Python and pandas library to read in the CSV file, drop the `Location Name` column since it is unnecessary, encode the `Category` column using `LabelEncoder` from the `sklearn.preprocessing` library, and split the data into training and testing sets using train_test_split from the sklearn.model_selection library. The training data should contain 80% of the data, and the testing data should contain 20%. Finally, we set a random state for reproducibility purposes.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Read in the data
df = pd.read_csv('fake_crime_reports.csv')

# Drop unnecessary columns
df.drop(['Location Name', 'Date'], axis=1, inplace=True)

# Encode categorical variables
le = LabelEncoder()
df['Category'] = le.fit_transform(df['Category'])

# Split the data into training and testing sets
X = df.drop(['Category'], axis=1)
y = df['Category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(df.head())

   Category  Latitude  Longitude
0         3 -1.321039  36.945499
1         4 -1.267184  36.767125
2         3 -1.230830  36.921711
3         4 -1.322356  36.853941
4         1 -1.303350  36.787716
