# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

*   The inputs/datasets/raw/Airplane_Complete_Imputation.csv downloaded from Kaggle and uploaded to gitpod workspace via local computer.

## Outputs

* Generate Dataset: outputs/datasets/collection/airplane_performance_study.csv

## Additional Comments

Method to download data directly from Kaggle to the notebook with an authentication token (Kaggle JSON file) gave a "403 - Forbidden".


---

# Install python packages in the notebooks

In [1]:
%pip install -r /workspace/churnometer_forked//requirements.txt

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: '/workspace/churnometer_forked//requirements.txt'[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/data-driven-design/jupyter_notebooks'

We want to make the parent of the current directory the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/data-driven-design'

# Fetch data from Kaggle

Download data file directly from Kaggle and upload to Explorer in Git Pod workspace via local computer:

Step 1: Go to [Aircraft Performance (Aircraft Bluebook) ](https://www.kaggle.com/datasets/heitornunes/aircraft-performance-dataset-aircraft-bluebook?select=Airplane_Complete_Imputation.csv)

Step 2: Scroll down and choose the "Airplane Complete Imputation.csv"-file 

<img src="/workspace/data-driven-design/images_notebook/choose_file_from_kaggle.jpg" alt="Screenshot showing which file on Kaggle to download" width="700" />

Step 3: Download data file from Kaggle to local computer [Aircraft Performance (Aircraft Bluebook) ](https://www.kaggle.com/datasets/heitornunes/aircraft-performance-dataset-aircraft-bluebook?select=Airplane_Complete_Imputation.csv)

<img src="/workspace/data-driven-design/images_notebook/download_from_kaggle.jpg" alt="Screenshot showing where to download the data on Kaggle" width="700" />

---

# Load and Inspect Kaggle data

In [9]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


DataFrame Summary

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


We want to check if there are duplicated `customerID`: There are not.

In [11]:
df[df.duplicated(subset=['customerID'])]

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn


Converting `TotalCharges` to numeric

In [12]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'] ,errors='coerce')

Check `TotalCharges` data type

In [13]:
df['TotalCharges'].dtype

dtype('float64')

We noticed `Churn` is a categorical variable: Yes or No. We will replace/convert it to an integer as the ML model requires numeric variables. 

In [14]:
df['Churn'].unique()

array(['No', 'Yes'], dtype=object)

In [15]:
df['Churn'] = df['Churn'].replace({"Yes":1, "No":0})

Check the `Churn` data type.

In [16]:
df['Churn'].dtype

dtype('int64')

# Push files to Repo

In [17]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/airplane_performance_study.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'
