# **Data Collection Workbook**

<!-- ## Objectives

* Fetch data from Kaggle and save as raw data
* Merge datasets 
* Inspect the data and save it under outputs/datasets/collection

## Inputs

*   Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/TelcoCustomerChurn.csv

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 
 -->


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\purpk\\OneDrive\\Documents\\Coding\\VineFind\\VineFind\\VineFind_v1\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\purpk\\OneDrive\\Documents\\Coding\\VineFind\\VineFind\\VineFind_v1'

---

# Import libraries

In [4]:
import missingno as msno
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport
import webbrowser

  from .autonotebook import tqdm as notebook_tqdm


# Section 1: Load data

Section 1 content

Install Kaggle package to fetch data

In [5]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Recognise the token **kaggle.json** file

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

Get the dataset path from the Kaggle url
* When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ .

Define the Kaggle dataset, and destination folder and download it.

In [7]:
KaggleDatasetPath = "zynicide/wine-reviews"
DestinationFolder = "VineFind_v1/inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "c:\Users\purpk\OneDrive\Documents\Coding\VineFind\.venv\Scripts\kaggle.exe\__main__.py", line 4, in <module>
  File "c:\Users\purpk\OneDrive\Documents\Coding\VineFind\.venv\Lib\site-packages\kaggle\__init__.py", line 23, in <module>
    api.authenticate()
  File "c:\Users\purpk\OneDrive\Documents\Coding\VineFind\.venv\Lib\site-packages\kaggle\api\kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in c:\Users\purpk\OneDrive\Documents\Coding\VineFind\VineFind\VineFind_v1. Or use the environment method.


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [8]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/wine-reviews.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/wine-reviews.zip')

FileNotFoundError: [Errno 2] No such file or directory: 'VineFind_v1/inputs/datasets/raw/wine-reviews.zip'

# Section 2: Merge datasets 

#### Wine Mag Data first 150K

In [None]:
import pandas as pd
df = pd.read_csv(f"VineFind/VineFind_v1/inputs/datasets/raw/winemag-data_first150k.csv")
df.head(1)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150930 entries, 0 to 150929
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Unnamed: 0   150930 non-null  int64  
 1   country      150925 non-null  object 
 2   description  150930 non-null  object 
 3   designation  105195 non-null  object 
 4   points       150930 non-null  int64  
 5   price        137235 non-null  float64
 6   province     150925 non-null  object 
 7   region_1     125870 non-null  object 
 8   region_2     60953 non-null   object 
 9   variety      150930 non-null  object 
 10  winery       150930 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 12.7+ MB


Summary of null values for **winemag-data_first150k.csv**

In [None]:
df.isnull().sum()

Unnamed: 0         0
country            5
description        0
designation    45735
points             0
price          13695
province           5
region_1       25060
region_2       89977
variety            0
winery             0
dtype: int64

### Wine Mag Data first 130K

In [None]:
df_130 = pd.read_csv(f"VineFind/VineFind_v1/inputs/datasets/raw/winemag-data-130k-v2.csv")
df_130.head(1)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia


In [None]:
df_130.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 14 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             129971 non-null  int64  
 1   country                129908 non-null  object 
 2   description            129971 non-null  object 
 3   designation            92506 non-null   object 
 4   points                 129971 non-null  int64  
 5   price                  120975 non-null  float64
 6   province               129908 non-null  object 
 7   region_1               108724 non-null  object 
 8   region_2               50511 non-null   object 
 9   taster_name            103727 non-null  object 
 10  taster_twitter_handle  98758 non-null   object 
 11  title                  129971 non-null  object 
 12  variety                129970 non-null  object 
 13  winery                 129971 non-null  object 
dtypes: float64(1), int64(2), object(11)


### Summary of null values for **winemag-data-130k-v2.csv**

In [None]:
df_130.isnull().sum()

Unnamed: 0                   0
country                     63
description                  0
designation              37465
points                       0
price                     8996
province                    63
region_1                 21247
region_2                 79460
taster_name              26244
taster_twitter_handle    31213
title                        0
variety                      1
winery                       0
dtype: int64

### Concat data sets and identify duplicates


In [None]:
concat_df = pd.concat([df, df_130], axis=0)
concat_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 280901 entries, 0 to 129970
Data columns (total 14 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             280901 non-null  int64  
 1   country                280833 non-null  object 
 2   description            280901 non-null  object 
 3   designation            197701 non-null  object 
 4   points                 280901 non-null  int64  
 5   price                  258210 non-null  float64
 6   province               280833 non-null  object 
 7   region_1               234594 non-null  object 
 8   region_2               111464 non-null  object 
 9   variety                280900 non-null  object 
 10  winery                 280901 non-null  object 
 11  taster_name            103727 non-null  object 
 12  taster_twitter_handle  98758 non-null   object 
 13  title                  129971 non-null  object 
dtypes: float64(1), int64(2), object(11)
memor

---

# Push files to Repo

# Output

Create a new csv file of the collected data

In [None]:
import os
try:
  os.makedirs(name='VineFind/VineFind_v1/outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)
  
concat_df.to_csv(f"VineFind/VineFind_v1/outputs/datasets/collection/wine_reviews_collected.csv", index=False)
