# Import Libraries

In [23]:
import pandas as pd
import kaggle 
import numpy as np
import sklearn
import os

### Using Kaggle data on your own machine

Kaggle limits your weekly time using a GPU machine. The limits are very generous, but you may well still find it's not enough! In that case, you'll want to use your own GPU server, or a cloud server such as Colab, Paperspace Gradient, or SageMaker Studio Lab (all of which have free options). To do so, you'll need to be able to download Kaggle datasets.

The easiest way to download Kaggle datasets is to use the Kaggle API. You can install this using `pip` by running this in a notebook cell:

    !pip install kaggle

You need an API key to use the Kaggle API; to get one, click on your profile picture on the Kaggle website, and choose My Account, then click Create New API Token. This will save a file called *kaggle.json* to your PC. You need to copy this key on your GPU server. To do so, open the file you downloaded, copy the contents, and paste them in the following cell (e.g., `creds = '{"username":"xxx","key":"xxx"}'`):

In [24]:

iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

Then execute this cell (this only needs to be run once):

In [26]:
# for working with paths in Python, I recommend using `pathlib.Path`
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

Now you can download datasets from Kaggle.

In [27]:
path = Path('playground-series-s4e5')

And use the Kaggle API to download the dataset to that path, and extract it:

In [29]:
if not iskaggle and not path.exists():
    import zipfile,kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

Note that you can easily download notebooks from Kaggle and upload them to other cloud services. So if you're low on Kaggle GPU credits, give this a try!

## Import and EDA

In [31]:
if iskaggle:
    path = Path('../input/playground-series-s4e5')
    !pip install -q datasets

Documents in NLP datasets are generally in one of two main forms:

- **Larger documents**: One text file per document, often organised into one folder per category
- **Smaller documents**: One document (or document pair, optionally with metadata) per row in a [CSV file](https://realpython.com/python-csv/).

Let's look at our data and see what we've got. In Jupyter you can use any bash/shell command by starting a line with a `!`, and use `{}` to include python variables, like so:

In [32]:
!ls {path}

'ls' is not recognized as an internal or external command,
operable program or batch file.


It looks like this competition uses CSV files. For opening, manipulating, and viewing CSV files, it's generally best to use the Pandas library, which is explained brilliantly in [this book](https://wesmckinney.com/book/) by the lead developer (it's also an excellent introduction to matplotlib and numpy, both of which I use in this notebook). Generally it's imported as the abbreviation `pd`.

Let's set a path to our data:

In [54]:
train = pd.read_csv(path/'train.csv')
test = pd.read_csv(path/'test.csv')

This creates a [DataFrame](https://pandas.pydata.org/docs/user_guide/10min.html), which is a table of named columns, a bit like a database table. To view the first and last rows, and row count of a DataFrame, just type its name:

In [49]:
train

Unnamed: 0,id,MonsoonIntensity,TopographyDrainage,RiverManagement,Deforestation,Urbanization,ClimateChange,DamsQuality,Siltation,AgriculturalPractices,...,DrainageSystems,CoastalVulnerability,Landslides,Watersheds,DeterioratingInfrastructure,PopulationScore,WetlandLoss,InadequatePlanning,PoliticalFactors,FloodProbability
0,0,5,8,5,8,6,4,4,3,3,...,5,3,3,5,4,7,5,7,3,0.45
1,1,6,7,4,4,8,8,3,5,4,...,7,2,0,3,5,3,3,4,3,0.45
2,2,6,5,6,7,3,7,1,5,4,...,7,3,7,5,6,8,2,3,3,0.53
3,3,3,4,6,5,4,8,4,7,6,...,2,4,7,4,4,6,5,7,5,0.54
4,4,5,3,2,6,4,4,3,3,3,...,2,2,6,6,4,1,2,3,5,0.41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1117952,1117952,3,3,4,10,4,5,5,7,10,...,7,8,7,2,2,1,4,6,4,0.49
1117953,1117953,2,2,4,3,9,5,8,1,3,...,9,4,4,3,7,4,9,4,5,0.48
1117954,1117954,7,3,9,4,6,5,9,1,3,...,5,5,5,5,6,5,5,2,4,0.48
1117955,1117955,7,3,3,7,5,2,3,4,6,...,6,8,5,3,4,6,7,6,4,0.49


It's important to carefully read the [dataset description](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data) to understand how each of these columns is used.

One of the most useful features of `DataFrame` is the `describe()` method:

In [50]:
train.describe(include='all')

Unnamed: 0,id,MonsoonIntensity,TopographyDrainage,RiverManagement,Deforestation,Urbanization,ClimateChange,DamsQuality,Siltation,AgriculturalPractices,...,DrainageSystems,CoastalVulnerability,Landslides,Watersheds,DeterioratingInfrastructure,PopulationScore,WetlandLoss,InadequatePlanning,PoliticalFactors,FloodProbability
count,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,...,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0
mean,558978.0,4.92,4.93,4.96,4.94,4.94,4.93,4.96,4.93,4.94,...,4.95,4.95,4.93,4.93,4.93,4.93,4.95,4.94,4.94,0.5
std,322726.53,2.06,2.09,2.07,2.05,2.08,2.06,2.08,2.07,2.07,...,2.07,2.09,2.08,2.08,2.06,2.07,2.07,2.08,2.09,0.05
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.28
25%,279489.0,3.0,3.0,4.0,4.0,3.0,3.0,4.0,3.0,3.0,...,4.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0,0.47
50%,558978.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,0.51
75%,838467.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,...,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,0.54
max,1117956.0,16.0,18.0,16.0,17.0,17.0,17.0,16.0,16.0,16.0,...,17.0,17.0,16.0,16.0,17.0,18.0,19.0,16.0,16.0,0.72


In [59]:
test.describe(include='all')

Unnamed: 0,id,MonsoonIntensity,TopographyDrainage,RiverManagement,Deforestation,Urbanization,ClimateChange,DamsQuality,Siltation,AgriculturalPractices,...,IneffectiveDisasterPreparedness,DrainageSystems,CoastalVulnerability,Landslides,Watersheds,DeterioratingInfrastructure,PopulationScore,WetlandLoss,InadequatePlanning,PoliticalFactors
count,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,...,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0
mean,1490609.0,4.92,4.93,4.96,4.95,4.94,4.93,4.96,4.93,4.95,...,4.95,4.94,4.96,4.93,4.93,4.93,4.93,4.95,4.94,4.94
std,215151.17,2.06,2.09,2.07,2.05,2.08,2.06,2.09,2.07,2.07,...,2.08,2.07,2.09,2.08,2.08,2.07,2.07,2.07,2.08,2.09
min,1117957.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1304283.0,3.0,3.0,4.0,4.0,3.0,3.0,4.0,3.0,3.0,...,3.0,4.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0
50%,1490609.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
75%,1676935.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,...,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0
max,1863261.0,16.0,17.0,16.0,17.0,17.0,17.0,16.0,16.0,16.0,...,16.0,17.0,17.0,16.0,16.0,17.0,19.0,22.0,16.0,16.0


## Explanation of columns
- Target value: FloodProbability

In [55]:
target_column = train['FloodProbability']
#train['id'] = train['id'] + 1

In [63]:
from sklearn.model_selection import train_test_split

selected_columns = train.columns.drop(['FloodProbability'])
X = train[selected_columns]
y = train['FloodProbability']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [64]:
print('Shape of train:', X_train.shape, y_train.shape)
print('Shape of test:', X_test.shape, y_test.shape)

Shape of train: (894365, 21) (894365,)
Shape of test: (223592, 21) (223592,)


In [71]:
from lazypredict import LazyRegressor

from sklearn import datasets
from sklearn.utils import shuffle
import numpy as np

boston = datasets.load_boston()
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)

offset = int(X.shape[0] * 0.9)

X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]

reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models, predictions = reg.fit(X_train, X_test, y_train, y_test)

print(models)

ImportError: cannot import name 'LazyRegressor' from 'lazypredict' (c:\Users\gokhan.elbistan\Documents\GitHub\kaggle-competitions\.venv\Lib\site-packages\lazypredict\__init__.py)

In [58]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred)

# Calculate root mean squared error
rmse = np.sqrt(mse)

# Calculate mean absolute error
mae = mean_absolute_error(y_test, y_pred)

# Calculate R-squared
r2 = r2_score(y_test, y_pred)

mse, rmse, mae, r2


ValueError: Found input variables with inconsistent numbers of samples: [223592, 745305]