# **Data Cleaning Notebook**

## Objectives

*   Evaluate missing data
*   Clean data
*   Explore whether the `['description']` variable is truncated

## Inputs

*   VineFind_v1\outputs\datasets\collection\wine_reviews_collected.csv

## Outputs

*   Generate clean dataset -  VineFind_v1/outputs/datasets/cleaned/wine_reviews_cleaned.csv

## Conclusions

 
  * Data Cleaning Pipeline
  *Drop Variables:  `['customerID', 'TotalCharges' ]`*


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\purpk\\OneDrive\\Documents\\Coding\\VineFind\\VineFind\\VineFind_v1\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\purpk\\OneDrive\\Documents\\Coding\\VineFind\\VineFind\\VineFind_v1'

# Imports libraries

In [5]:
import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# Section 1: Import Data

Section 1 content

In [6]:
df = pd.read_csv(f"outputs/datasets/collection/wine_reviews_collected.csv", dtype={11: str, 12: str, 13: str})
df.head(1)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,taster_name,taster_twitter_handle,title
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,,,


# Section 2: Data Cleaning

#### Remove `['unnamed']` feature

In [7]:
df = df.drop(columns=["Unnamed: 0"])
df.head(1)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,taster_name,taster_twitter_handle,title
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,,,


### Drop duplicates from the dataset 

In [10]:
df.drop_duplicates(subset=['description'], inplace=True)
df.reset_index(drop=True, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169430 entries, 0 to 169429
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                169370 non-null  object 
 1   description            169430 non-null  object 
 2   designation            119363 non-null  object 
 3   points                 169430 non-null  int64  
 4   price                  156609 non-null  float64
 5   province               169370 non-null  object 
 6   region_1               141517 non-null  object 
 7   region_2               67516 non-null   object 
 8   variety                169429 non-null  object 
 9   winery                 169430 non-null  object 
 10  taster_name            62354 non-null   object 
 11  taster_twitter_handle  59291 non-null   object 
 12  title                  71609 non-null   object 
dtypes: float64(1), int64(1), object(11)
memory usage: 16.8+ MB


### Data type conversion

In [25]:
columns_to_change = {"country", "description", "designation", "province", "region_1", "region_2", "variety", "winery", "taster_name", "taster_twitter_handle", "title"}

for col in columns_to_change:
    if col in df.columns:
        df[col] = df[col].astype('string')

    else:
        print(f"Column {col} not found in DataFrame.")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280901 entries, 0 to 280900
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                280833 non-null  string 
 1   description            280901 non-null  string 
 2   designation            280901 non-null  string 
 3   points                 280901 non-null  int64  
 4   price                  258210 non-null  float64
 5   province               280901 non-null  string 
 6   region_1               280901 non-null  string 
 7   region_2               280901 non-null  string 
 8   variety                280901 non-null  string 
 9   winery                 280901 non-null  string 
 10  taster_name            280901 non-null  string 
 11  taster_twitter_handle  280901 non-null  string 
 12  title                  280901 non-null  string 
dtypes: float64(1), int64(1), string(11)
memory usage: 27.9 MB


In [26]:
description = df['description']
description[4]

'this is the top wine from la bgude named after the highest point in the vineyard at feet it has structure density and considerable acidity that is still calming down with months in wood the wine has developing an extra richness and concentration produced by the tari family formerly of chteau giscours in margaux it is a wine made for aging drink from'

---

# Cleaning

### BERT

This is data cleaning for BERT. This adds uniformity to the text but tries to perserve the lingustic intention that is required to train the model.


In [23]:
from unidecode import unidecode
import re

def remove_accents(text):
    """
    Removes accents from text, preserving common Latin-based words.
    """
    exceptions = [
        "Chateau", "Château", "café", "thé", "vino", "città", "vinho", "país"
    ]
    words = text.split()
    cleaned_words = []
    for word in words:
        if word in exceptions:
            cleaned_words.append(word)
        else:
            cleaned_words.append(unidecode(word))
    return " ".join(cleaned_words)

def remove_punctuation_spaces(text):
    """
    Removes only commas, semicolons, full stops, and exclamation marks.
    """
    text = re.sub(r'\s+', ' ', text)#
    return re.sub(r'[,;.!]', '', text)

def clean_text(text):
    """
    Cleans the text by removing accents, punctuation, and converting to lowercase.
    """
    text = text.lower()
    text = remove_accents(text)
    text = remove_punctuation_spaces(text)
    return text

# Apply cleaning to your text column
df['description'] = df['description'].astype(str).apply(clean_text)
de_accent = ['designation', 'province', 'region_1', 'region_2', 'variety', 'winery', 'taster_name', 'taster_twitter_handle', 'title']
for col in de_accent:
    df[col] = df[col].astype(str).apply(remove_accents)

df.tail()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,taster_name,taster_twitter_handle,title
280896,Germany,notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spatlese,90,28.0,Mosel,,,Riesling,Dr. H. Thanisch (Erben Muller-Burggraef),Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Muller-Burggraef) 2013 ...
280897,US,citation is given as much as a decade of bottl...,,90,75.0,Oregon,Oregon,Oregon Other,Pinot Noir,Citation,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon)
280898,France,welldrained gravel soil gives this wine its cr...,Kritt,90,30.0,Alsace,Alsace,,Gewurztraminer,Domaine Gresser,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...
280899,France,a dry style of pinot gris this is crisp with s...,,90,32.0,Alsace,Alsace,,Pinot Gris,Domaine Marcel Deiss,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace)
280900,France,big rich and offdry this is powered by intense...,Lieu-dit Harth Cuvee Caroline,90,21.0,Alsace,Alsace,,Gewurztraminer,Domaine Schoffit,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvee Car...


NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

# Push files to Repo

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [32]:
import os
try:
    os.makedirs('VineFind/VineFind_v1/outputs/datasets/cleaned')
except Exception as e:
    print(e)

df.to_csv('VineFind/VineFind_v1/outputs/datasets/cleaned/wine_reviews_cleaned.csv', index=False)
