# Introduction to Kaggle 

To introduce how to participate in a kaggle competition lets use the [taxi fare prediction competition](https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction) as our playground. The dataset can be found [here](https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/data)

In [None]:
import pandas as pd 

# Loading Data

In [None]:
taxi_train = pd.read_csv('../data/new-york-city-taxi-fare-prediction/train.csv')

In [None]:
taxi_train.columns.to_list()

In [None]:
taxi_test = pd.read_csv('../data/new-york-city-taxi-fare-prediction/test.csv')
taxi_test.columns.to_list()

Fare_amount is the predicted column, thats way its absent in the test dataset.

# Defining the problem

In [None]:
taxi_train.describe()

In [None]:
taxi_train.head()

In [None]:
import matplotlib.pyplot as plt 
import seaborn as sns 

#sns.histplot(taxi_train.fare_amount)

In [None]:
taxi_train.fare_amount.quantile(0.99)

In [None]:
taxi_train[(taxi_train.fare_amount >0) & (taxi_train.fare_amount < 80)].fare_amount.hist(bins=30, alpha=0.5)

In [None]:
taxi_train[(taxi_train.fare_amount >0) & (taxi_train.fare_amount < 80)]['key'].count()/len(taxi_train)

We are dealing here with a **regression** problem, since the predicted feature is continuous. 


In [None]:
taxi_train.isna().sum()/len(taxi_train)

In [None]:
taxi_train=taxi_train.dropna()

# Bulding a model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
lr = LinearRegression() 

X_train, X_test, y_train, y_test = train_test_split(taxi_train.drop(['fare_amount', 'pickup_datetime', 'key'], axis=1), taxi_train.fare_amount, test_size=0.2)

In [None]:
X_train.head()

In [None]:
lr.fit(X_train, y_train)

In [None]:
from sklearn.metrics import mean_squared_error

MSE_train = mean_squared_error(y_train, lr.predict(X_train), squared=False)
print('MSE_train:', MSE_train)

MSE_test = mean_squared_error(y_test, lr.predict(X_test), squared=False)
print('MSE_test:', MSE_test)


# Solution Workflow

Understand the problem > EDA > Local Validation > Modeling

## Understanding the problem 

### Data type

What kind of data are we dealing with? Tabular? Time Series? Images? Text? 

### Problem type 

Is it about classification? Regression? Ranking? 

### Evaluation metric 

ROC AUC, F1-Score, MAE, MSE... 

In general the evaluation metric can be found under the sklearn.metrics package, but sometimes there is a very specific metric in a particular competition. Sometimes we would have to implement the metric. 

## Exploratory Data Analysis 

Exploratory Data Analysis or EDA helps us: 
- Sizing the data
- Understanding the properties of the target variable
-  and the properties of the features
- Generate ideas for future engineering





In [None]:
MISSING

In [None]:
## Keep track of data splits, models, results...