## 1. Cleaning

This notebook focuses on the data cleaning and pre-processing phase of the coffee reviews analysis project.

Main objectives:
* Clean and standardize prices (convert to USD/kg)
* Handle missing values and outliers
* Clean and standardize country of origin data
* Clean textual data for various analysis tasks (embeddings, topic modeling, sentiment)
* Save processed datasets for downstream tasks


In [1]:
%load_ext autoreload
%autoreload 2

import sys
from pathlib import Path

# Add project root to Python path
project_root = Path().absolute().parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

import polars as pl
import plotly.express as px
from src.utils.cleaning import *
# Define paths using project structure
data_dir = project_root / "data"
output_dir = data_dir / "processed"
output_dir.mkdir(exist_ok=True)

output_paths = {
    "embeddings": output_dir / "embeddings.parquet",
    "topic_modeling": output_dir / "topic_modeling.parquet",
    "sentiment": output_dir / "sentiment.parquet"
}

## 1.1 Data Ingestion and Profiling

In this section, we load the coffee reviews dataset and perform initial profiling to understand its structure. This includes examining columns, data types, and checking for missing values or anomalies to inform our cleaning strategy.


In [2]:
# Load and profile data
data = pl.read_csv(
    data_dir / "raw" / "coffee_clean.csv",
    null_values="NA",
    infer_schema_length=10000
)

# Comprehensive data profiling
profile_dataset(data, "Coffee Reviews")

# Analyze numerical features
analyze_numerical_columns(data)

# Analyze target variable
check_target_variable(data, "rating")

# Additional checks for text columns
text_columns = ["desc_1", "desc_2", "desc_3"]
print("\nText Columns Sample:")
print("-" * 50)
display(data.select(text_columns).head())



Coffee Reviews Overview:
--------------------------------------------------
Number of records: 2,775
Number of features: 20

Column Information:
--------------------------------------------------

Detailed Column Analysis:
--------------------------------------------------------------------------------
Column               Type        Missing  Missing %   Unique Values
--------------------------------------------------------------------------------
slug                 String            0       0.0%            2775
all_text             String            0       0.0%            2774
rating               Int64             0       0.0%              22
roaster              String            0       0.0%             546
name                 String            0       0.0%            2510
location             String            1       0.0%             332
origin               String            0       0.0%            1158
roast                String           73       2.6%               7
es

statistic,rating,aroma,acid,body,flavor,aftertaste,with_milk
str,f64,f64,f64,f64,f64,f64,f64
"""count""",2775.0,2749.0,2395.0,2773.0,2773.0,2773.0,402.0
"""null_count""",0.0,26.0,380.0,2.0,2.0,2.0,2373.0
"""mean""",92.966847,8.807203,8.494781,8.594663,8.939055,8.087631,8.835821
"""std""",1.989242,0.487029,0.626178,0.52855,0.43424,0.548416,0.512108
"""min""",63.0,2.0,1.0,5.0,2.0,2.0,5.0
"""25%""",92.0,9.0,8.0,8.0,9.0,8.0,9.0
"""50%""",93.0,9.0,9.0,9.0,9.0,8.0,9.0
"""75%""",94.0,9.0,9.0,9.0,9.0,8.0,9.0
"""max""",98.0,10.0,10.0,10.0,10.0,10.0,10.0



Target Variable (rating) Analysis:
--------------------------------------------------
Basic Statistics:


statistic,value
str,f64
"""count""",2775.0
"""null_count""",0.0
"""mean""",92.966847
"""std""",1.989242
"""min""",63.0
"""25%""",92.0
"""50%""",93.0
"""75%""",94.0
"""max""",98.0


No missing values found in the target column.



Text Columns Sample:
--------------------------------------------------


desc_1,desc_2,desc_3
str,str,str
"""High-toned, crisply sweet. Dat…","""Produced by Luis Alberto Monto…","""A vivacious, sweetly tart Colo…"
"""Richly sweet, fruit-toned. Str…","""Produced by Danilo Salazar Ari…","""A fruit-forward Costa Rica cup…"
"""Rich-toned, cleanly fruit-forw…","""This exceptional coffee was se…","""A harmonious, refined, exquisi…"
"""Crisply sweet, fruit-toned. Ra…","""This coffee tied for the third…","""A sweetly fruit-forward natura…"
"""Crisp, elegantly sweet, rich-t…","""Produced from trees of the rar…","""Tropical fruit- and spice-tone…"


**Observations:**

- The dataset contains 2775 rows and 20 columns, with various fields related to coffee reviews, including ratings, roaster details, coffee descriptions, and sensory attributes like aroma, acid, body, flavor, and aftertaste.
- There are several missing values across different columns:
  - `origin` has 1 missing value.
  - `roast` has 73 missing values.
  - `est_price` has 6 missing values.
  - Sensory attributes like `aroma` (26 missing), `acid` (380 missing), and `with_milk` (2373 missing) have some missing values, with `with_milk` being recorded in only a small portion of the dataset.
- The `desc_1`, `desc_2`, and `desc_3` columns provide text descriptions of the coffee and have very few missing values (1–2 missing values per column).
- The `rating` column ranges from 63 to 98, with a mean of 92.97 and a standard deviation of 1.99, indicating that most reviews are for high-quality coffees.

**Impact:**

- The missing values in columns like `roast`, `est_price`, and sensory attributes (`aroma`, `acid`, etc.) could affect the analysis, so they need to be addressed carefully in the cleaning process.
- The `with_milk` column, with 2373 missing values out of 2775, may not provide meaningful insights due to the limited number of recorded instances, and it could potentially be dropped.
- The `desc_1`, `desc_2`, and `desc_3` text fields seem to be important and have very few missing values, making them useful for further analysis (e.g., topic modeling, sentiment analysis).
- The ratings indicate that most reviews are of relatively high-rated coffees, which could introduce a positive bias in the analysis.



## 1.2 Data Standardisation
Before dropping columns, we'll standardize two key features:
1. Price: Convert all prices to USD per kilogram
2. Country of Origin: Extract and standardize country information

This will help us make informed decisions about which features to keep or drop.

In [3]:
data = standardize_prices(data)
analyze_price_distribution(data)

# 1.2.2 Country of Origin Standardization
data = extract_and_correct_country(data)
country_stats = analyze_country_distribution(data)

# Document key findings
top_countries = country_stats.head(3).select('country_of_origin').to_series().to_list()

print("\nKey Observations:")
print("-" * 50)
print(f"• Successfully standardized prices to USD/kg")
print(f"• Extracted {data['country_of_origin'].n_unique()} unique countries")
print(f"• Top 3 countries by review count: {', '.join(top_countries)}")


Price Statistics (USD/kg):
--------------------------------------------------


Mean,Median,Std,Min,Max
f64,f64,f64,f64,f64
841.727919,58.79,17354.023455,2.116435,600000.0



Country Distribution Analysis:
--------------------------------------------------

Top 10 Countries by Number of Reviews:


country_of_origin,count,avg_rating,avg_price_per_kg
str,u32,f64,f64
"""Ethiopia""",662,93.321752,274.813309
"""Colombia""",305,93.003279,1145.173668
"""Kenya""",233,93.785408,121.518769
"""Guatemala""",186,92.182796,480.511519
"""United States""",139,93.561151,205.325054
"""Panama""",119,94.487395,531.12537
"""Costa Rica""",117,92.863248,77.393483
"""Indonesia""",86,92.872093,14182.561019
"""ND""",81,90.320988,137.508485
"""Peru""",59,92.440678,71.444239



Key Observations:
--------------------------------------------------
• Successfully standardized prices to USD/kg
• Extracted 183 unique countries
• Top 3 countries by review count: Ethiopia, Colombia, Kenya


In [4]:
# Standardize roast information
data = standardize_roast_degree(data)
roast_comparison = analyze_roast_standardization(data)

print("\nKey Insights:")
print("-" * 50)
print("• Added objective roast levels based on Agtron ground readings")
print("• Can compare subjective roast labels with objective measurements")
print("• Maintained both original and standardized roast information")
print("• Added separate columns for ground and whole bean readings")


Roast Level Analysis:
--------------------------------------------------

Distribution of Standardized Roast Levels:


roast_by_agtron,count,avg_ground,avg_whole
str,u32,f64,f64
,35,,58.0
"""Light""",28,290.0,87.964286
"""Medium-Light""",1818,59.015952,77.758526
"""Medium""",750,51.124,70.54
"""Medium-Dark""",89,41.134831,55.595506
"""Dark""",55,2.363636,3.872727



Comparison with Original Roast Labels:


roast,roast_by_agtron,count
str,str,u32
"""Medium-Light""","""Medium-Light""",1409
"""Medium-Light""","""Medium""",454
"""Light""","""Medium-Light""",405
"""Medium""","""Medium""",292
"""Medium-Dark""","""Medium-Dark""",53
…,…,…
"""Dark""","""Dark""",4
"""Medium""","""Medium-Light""",4
"""Dark""","""Medium-Dark""",4
"""Medium""",,1



Key Insights:
--------------------------------------------------
• Added objective roast levels based on Agtron ground readings
• Can compare subjective roast labels with objective measurements
• Maintained both original and standardized roast information
• Added separate columns for ground and whole bean readings


**Agtron Value Analysis and Treatment**

1. **Understanding Agtron Measurements**:
   - Agtron values appear in "xx/yy" format (e.g., "58/76")
   - First number: Ground coffee reading
   - Second number: Whole bean reading
   - Higher numbers indicate lighter roasts, lower numbers indicate darker roasts

2. **Professional Standards**:
   - Ground coffee readings provide more reliable roast level indication
   - Industry standard scale for ground coffee:
     * Higher than 65: Light roast
     * 55-65: Medium-Light roast
     * 45-55: Medium roast
     * 35-45: Medium-Dark roast
     * Lower than 35: Dark roast

3. **Treatment Approach**:
   - Split original Agtron values into separate ground and whole bean readings
   - Created standardized roast categories based on ground readings
   - Maintained both objective (Agtron-based) and subjective (original roast label) information
   - This allows for comparison between measured roast levels and roaster-assigned categories

4. **Value for Analysis**:
   - Provides objective measure of roast degree
   - Enables standardization across different roasters
   - Helps validate subjective roast level assignments
   - Can be used to analyze relationship between roast degree and coffee ratings

## 1.3 Dropping Irrelevant Columns

Based on the observations, we have decided to drop several columns that are either irrelevant for analysis or contain a high number of missing values. This helps streamline the dataset and focus on the most important features for further analysis. After standardizing key variables (price, country, roast levels), we can now streamline our dataset by removing irrelevant or redundant columns.

- **Summary**:
  - `slug`: Contains URLs of the reviews, which holds no predictive value.
  - `all_text`: Replaced by more granular `desc_1`, `desc_2`, and `desc_3` columns.
  - `name`: Highly granular with almost no repeating values, making it unhelpful for analysis.
  - `location`: Roaster's location data is inconsistent and not directly relevant to coffee quality (unless you want to do an assesment on the roaster's impact on the coffee quality, but that can also be captured with the `roaster` variable).
  - `origin`: Replaced by country_of_origin in the above section
  - `review_date`: Unsure on how to treat this variable. Given that there's insuficient data for meaningful temporal or seasonal analysis, we dropped the variable
  - `agtron`: Explanation above. Split into ground and bean readings. Develop an objective measure based on this column.
  - `with_milk`: Very few instances (2373 out of 2775 missing), which makes it less useful for analysis.

Once these columns are removed, the dataset will be more concise and focused on the features that matter most for subsequent analysis.


In [5]:
# Store original data for comparison
data_original = data.clone()

# Drop irrelevant columns
data = drop_irrelevant_columns(data)

# Summarize changes
summarize_column_changes(data_original, data)


Column Changes:
--------------------------------------------------
Original columns: 25
Remaining columns: 17

Remaining columns:
• rating
• roaster
• roast
• est_price
• aroma
• acid
• body
• flavor
• aftertaste
• desc_1
• desc_2
• desc_3
• price_per_kg
• country_of_origin
• agtron_ground
• agtron_whole
• roast_by_agtron


## 1.3 Handling Missing Values and Outliers

After dropping irrelevant columns, we need to handle the remaining missing values and outliers to ensure a clean dataset for further analysis.

#### Handling Missing Values

- **Missing Values Strategy**:
  - Missing values are primarily concentrated in key sensory attributes such as `aroma`, `acid`, `body`, `flavor`, and `aftertaste`.
  - Since these attributes are crucial for our analysis, rows with missing values in these columns will be dropped to avoid distorting the model outcomes.

- **Columns with Missing Values**:
  - `roast`: Missing observations will be dropped
  - `aroma`, `acid`, `body`, `flavor`, and `aftertaste`: Rows with missing values in these sensory attributes will be dropped.
  - `desc_1`, `desc_2`, and `desc_3`: Very few missing values (1-2 per column), so we'll drop these rows as well.

#### Handling Outliers

- **Outliers Strategy**:
  - Outliers can skew the results and lead to inaccurate conclusions.
  - We'll detect outliers in the `rating` column using a boxplot and remove entries with abnormally low ratings (below 80) that appear to represent poor-quality or instant coffees, which are not the focus of our analysis.

- **Outliers Detected**:
  - Ratings below 80 are considered outliers as they are primarily instant coffees or ultra-dark roasts, which differ significantly from the premium coffees in the dataset.
  - These will be removed to maintain consistency in the quality of reviews.

By addressing both missing values and outliers, we ensure that the dataset is clean and reliable for further analysis.


In [6]:
# Analyze current state of missing values
analyze_missing_values(data)

# Analyze outliers in ratings
analyze_outliers(data, "rating")

# Clean dataset and get statistics
data_cleaned, cleaning_stats = clean_dataset(data)

# Display cleaning results
print("\nData Cleaning Results:")
print("-" * 50)
print(f"• Initial rows: {cleaning_stats['initial_rows']:,}")
print(f"• Rows after rating filter (≥80): {cleaning_stats['rows_after_rating_filter']:,}")
print(f"• Final rows: {cleaning_stats['final_rows']:,}")
print(f"\nRows removed:")
print(f"• By rating filter: {cleaning_stats['removed_by_rating']:,}")
print(f"• By missing values: {cleaning_stats['removed_by_missing']:,}")
print(f"• Total removed: {cleaning_stats['total_removed']:,} ({cleaning_stats['total_removed_pct']:.1f}%)")

# Verify final dataset quality
print("\nFinal Data Quality Check:")
print("-" * 50)
analyze_missing_values(data_cleaned)


Missing Values Analysis:
--------------------------------------------------


Column,Missing,Missing %
str,i64,str
"""acid""",380,"""13.7%"""
"""roast""",73,"""2.6%"""
"""agtron_ground""",35,"""1.3%"""
"""roast_by_agtron""",35,"""1.3%"""
"""agtron_whole""",34,"""1.2%"""
…,…,…
"""body""",2,"""0.1%"""
"""flavor""",2,"""0.1%"""
"""aftertaste""",2,"""0.1%"""
"""desc_3""",2,"""0.1%"""



Outlier Analysis for rating:
--------------------------------------------------

Basic Statistics:


mean,median,std,min,max
f64,f64,f64,i64,i64
92.966847,93.0,1.989242,63,98



Data Cleaning Results:
--------------------------------------------------
• Initial rows: 2,775
• Rows after rating filter (≥80): 2,771
• Final rows: 2,322

Rows removed:
• By rating filter: 4
• By missing values: 449
• Total removed: 453 (16.3%)

Final Data Quality Check:
--------------------------------------------------

Missing Values Analysis:
--------------------------------------------------


Column,Missing,Missing %
str,i64,str


## 1.4 Text Preprocessing

Text data in this dataset includes descriptive fields about the coffee, which need to be cleaned and standardized for further analysis, such as embeddings, topic modeling, and sentiment analysis.

### Text Preprocessing Pipeline:

1. **Removing URLs**: 
   - URLs are not useful for text analysis and can be removed to avoid unnecessary noise in the data.

2. **Cleaning Text**: 
   - The text is converted to lowercase to ensure consistency. Any digits or content in square brackets (like footnotes) are removed.

3. **Removing Punctuation**: 
   - Punctuation does not contribute to the semantic meaning of text and is typically removed for embeddings and topic modeling tasks. For sentiment analysis, it can be useful, so punctuation is retained.

4. **Stopword Removal**: 
   - Common words like "the", "is", and "and" do not add much value in most text analyses and are removed to focus on meaningful words. This step is necessary for embeddings but not for topic modeling.

5. **Lemmatization**: 
   - Words are reduced to their base forms (e.g., "running" becomes "run") to avoid treating different forms of the same word as separate entities. This step improves the consistency of text data.

6. **Handling Negations**:
   - To preserve meaning in sentences, negations such as "isn't" are converted to "is not". This helps in ensuring that the sentiment or meaning in the text remains intact during analysis.

### Application of Preprocessing:

- The text preprocessing will be applied to the description columns: `desc_1`, `desc_2`, and `desc_3`. These columns contain detailed text descriptions of the coffee that are vital for analysis.
  
- **For embeddings and sentiment analysis**: 
   - Punctuation will be retained, and stopwords will be removed to focus on the most meaningful content.
  
- **For topic modeling**: 
   - Stopwords will be retained to capture thematic patterns, but punctuation will be removed.


In [7]:
# Download required NLTK resources
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

# Process text for different analysis types
desc_columns = ['desc_1', 'desc_2', 'desc_3']
df_embeddings, df_topic_modeling, df_sentiment = process_and_analyze_text(data_cleaned, desc_columns)

# Store processed datasets
data_embeddings = df_embeddings
data_topic = df_topic_modeling
data_sentiment = df_sentiment

[nltk_data] Downloading package punkt to /Users/seijas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/seijas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/seijas/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/seijas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



Text Preprocessing Summary:
--------------------------------------------------
Applying text preprocessing...
Text preprocessing complete.

Sample of processed text for each purpose:

Embeddings preprocessing (stopwords removed, punctuation retained):


processed_desc_1,processed_desc_2,processed_desc_3
str,str,str
"""high-toned , crisply sweet . d…","""produced luis alberto montoya …","""vivacious , sweetly tart colom…"
"""richly sweet , fruit-toned . s…","""produced danilo salazar aria f…","""fruit-forward costa rica cup t…"



Topic Modeling preprocessing (stopwords retained, punctuation removed):


processed_desc_1,processed_desc_2,processed_desc_3
str,str,str
"""hightoned crisply sweet date…","""produced luis alberto montoya …","""vivacious sweetly tart colomb…"
"""richly sweet fruittoned stra…","""produced danilo salazar aria f…","""fruitforward costa rica cup th…"



Text Statistics:

Embeddings:
• desc_1 average word count: 40.8
• desc_2 average word count: 56.3
• desc_3 average word count: 22.7

Topic Modeling:
• desc_1 average word count: 40.8
• desc_2 average word count: 56.3
• desc_3 average word count: 22.7


## 1.5 Final Dataset Checks and Saving

After preprocessing, it's crucial to ensure consistency and integrity across all datasets before moving on to the next phase. This step helps to avoid issues in later tasks such as modeling or analysis.

### Final Checks:

1. **Column Consistency**:
   - All datasets (embeddings, topic modeling, sentiment analysis) should have consistent columns, particularly for the processed description fields.
  
2. **Check for Missing Values**:
   - After preprocessing, there should be no missing values in the processed datasets. A final check will ensure that no rows contain missing values, which could disrupt further analysis.

3. **Row Count Consistency**:
   - Ensure that the number of rows remains the same across the different datasets to maintain consistency in analysis.

In [9]:
# Saving data
save_parquet(df_embeddings, output_paths["embeddings"], "Embeddings")
save_parquet(df_topic_modeling, output_paths["topic_modeling"], "Topic Modeling")
save_parquet(df_sentiment, output_paths["sentiment"], "Sentiment Analysis")

# Loading data for verification
df_embeddings = load_parquet(output_paths["embeddings"], "Embeddings")
df_topic_modeling = load_parquet(output_paths["topic_modeling"], "Topic Modeling")
df_sentiment = load_parquet(output_paths["sentiment"], "Sentiment Analysis")

# Checking column consistency
check_column_consistency(df_embeddings, df_topic_modeling, df_sentiment)

# Checking missing values
check_missing_values(df_embeddings, "Embeddings")
check_missing_values(df_topic_modeling, "Topic Modeling")
check_missing_values(df_sentiment, "Sentiment Analysis")

Embeddings data saved to /Users/seijas/Code/coffee-text-analytics/data/processed/embeddings.parquet
Topic Modeling data saved to /Users/seijas/Code/coffee-text-analytics/data/processed/topic_modeling.parquet
Sentiment Analysis data saved to /Users/seijas/Code/coffee-text-analytics/data/processed/sentiment.parquet
Embeddings DataFrame:


rating,roaster,roast,est_price,aroma,acid,body,flavor,aftertaste,desc_1,desc_2,desc_3,price_per_kg,country_of_origin,agtron_ground,agtron_whole,roast_by_agtron,processed_desc_1,processed_desc_2,processed_desc_3
i64,str,str,str,i64,i64,i64,i64,i64,str,str,str,f64,str,f64,f64,str,str,str,str
92,"""Temple Coffee and Tea""","""Medium-Light""","""$18.50/12 ounces""",9,8,8,9,8,"""High-toned, crisply sweet. Dat…","""Produced by Luis Alberto Monto…","""A vivacious, sweetly tart Colo…",54.38075,"""Colombia""",51.0,73.0,"""Medium""","""high-toned , crisply sweet . d…","""produced luis alberto montoya …","""vivacious , sweetly tart colom…"
92,"""Oceana Coffee""","""Medium-Light""","""$22.00/12 ounces""",9,8,8,9,8,"""Richly sweet, fruit-toned. Str…","""Produced by Danilo Salazar Ari…","""A fruit-forward Costa Rica cup…",64.669,"""Costa Rica""",54.0,75.0,"""Medium""","""richly sweet , fruit-toned . s…","""produced danilo salazar aria f…","""fruit-forward costa rica cup t…"
96,"""Dragonfly Coffee Roasters""","""Medium-Light""","""$75.00/8 ounces""",9,9,9,10,9,"""Rich-toned, cleanly fruit-forw…","""This exceptional coffee was se…","""A harmonious, refined, exquisi…",330.69375,"""Panama""",56.0,84.0,"""Medium-Light""","""rich-toned , cleanly fruit-for…","""exceptional coffee wa selected…","""harmonious , refined , exquisi…"
93,"""Caffeic""","""Medium-Light""","""$13.50/12 ounces""",9,9,8,9,8,"""Crisply sweet, fruit-toned. Ra…","""This coffee tied for the third…","""A sweetly fruit-forward natura…",39.68325,"""Ethiopia""",56.0,82.0,"""Medium-Light""","""crisply sweet , fruit-toned . …","""coffee tied third-highest rati…","""sweetly fruit-forward natural-…"
96,"""Dragonfly Coffee Roasters""","""Medium-Light""","""$45.00/8 ounces""",9,9,9,10,9,"""Crisp, elegantly sweet, rich-t…","""Produced from trees of the rar…","""Tropical fruit- and spice-tone…",198.41625,"""Panama""",56.0,76.0,"""Medium-Light""","""crisp , elegantly sweet , rich…","""produced tree rare ethiopia-de…","""tropical fruit- spice-toned ar…"


Topic Modeling DataFrame:


rating,roaster,roast,est_price,aroma,acid,body,flavor,aftertaste,desc_1,desc_2,desc_3,price_per_kg,country_of_origin,agtron_ground,agtron_whole,roast_by_agtron,processed_desc_1,processed_desc_2,processed_desc_3
i64,str,str,str,i64,i64,i64,i64,i64,str,str,str,f64,str,f64,f64,str,str,str,str
92,"""Temple Coffee and Tea""","""Medium-Light""","""$18.50/12 ounces""",9,8,8,9,8,"""High-toned, crisply sweet. Dat…","""Produced by Luis Alberto Monto…","""A vivacious, sweetly tart Colo…",54.38075,"""Colombia""",51.0,73.0,"""Medium""","""hightoned crisply sweet date…","""produced luis alberto montoya …","""vivacious sweetly tart colomb…"
92,"""Oceana Coffee""","""Medium-Light""","""$22.00/12 ounces""",9,8,8,9,8,"""Richly sweet, fruit-toned. Str…","""Produced by Danilo Salazar Ari…","""A fruit-forward Costa Rica cup…",64.669,"""Costa Rica""",54.0,75.0,"""Medium""","""richly sweet fruittoned stra…","""produced danilo salazar aria f…","""fruitforward costa rica cup th…"
96,"""Dragonfly Coffee Roasters""","""Medium-Light""","""$75.00/8 ounces""",9,9,9,10,9,"""Rich-toned, cleanly fruit-forw…","""This exceptional coffee was se…","""A harmonious, refined, exquisi…",330.69375,"""Panama""",56.0,84.0,"""Medium-Light""","""richtoned cleanly fruitforwar…","""exceptional coffee wa selected…","""harmonious refined exquisite…"
93,"""Caffeic""","""Medium-Light""","""$13.50/12 ounces""",9,9,8,9,8,"""Crisply sweet, fruit-toned. Ra…","""This coffee tied for the third…","""A sweetly fruit-forward natura…",39.68325,"""Ethiopia""",56.0,82.0,"""Medium-Light""","""crisply sweet fruittoned ras…","""coffee tied thirdhighest ratin…","""sweetly fruitforward naturalpr…"
96,"""Dragonfly Coffee Roasters""","""Medium-Light""","""$45.00/8 ounces""",9,9,9,10,9,"""Crisp, elegantly sweet, rich-t…","""Produced from trees of the rar…","""Tropical fruit- and spice-tone…",198.41625,"""Panama""",56.0,76.0,"""Medium-Light""","""crisp elegantly sweet richto…","""produced tree rare ethiopiader…","""tropical fruit spicetoned arom…"


Sentiment Analysis DataFrame:


rating,roaster,roast,est_price,aroma,acid,body,flavor,aftertaste,desc_1,desc_2,desc_3,price_per_kg,country_of_origin,agtron_ground,agtron_whole,roast_by_agtron,processed_desc_1,processed_desc_2,processed_desc_3
i64,str,str,str,i64,i64,i64,i64,i64,str,str,str,f64,str,f64,f64,str,str,str,str
92,"""Temple Coffee and Tea""","""Medium-Light""","""$18.50/12 ounces""",9,8,8,9,8,"""High-toned, crisply sweet. Dat…","""Produced by Luis Alberto Monto…","""A vivacious, sweetly tart Colo…",54.38075,"""Colombia""",51.0,73.0,"""Medium""","""high-toned , crisply sweet . d…","""produced luis alberto montoya …","""vivacious , sweetly tart colom…"
92,"""Oceana Coffee""","""Medium-Light""","""$22.00/12 ounces""",9,8,8,9,8,"""Richly sweet, fruit-toned. Str…","""Produced by Danilo Salazar Ari…","""A fruit-forward Costa Rica cup…",64.669,"""Costa Rica""",54.0,75.0,"""Medium""","""richly sweet , fruit-toned . s…","""produced danilo salazar aria f…","""fruit-forward costa rica cup t…"
96,"""Dragonfly Coffee Roasters""","""Medium-Light""","""$75.00/8 ounces""",9,9,9,10,9,"""Rich-toned, cleanly fruit-forw…","""This exceptional coffee was se…","""A harmonious, refined, exquisi…",330.69375,"""Panama""",56.0,84.0,"""Medium-Light""","""rich-toned , cleanly fruit-for…","""exceptional coffee wa selected…","""harmonious , refined , exquisi…"
93,"""Caffeic""","""Medium-Light""","""$13.50/12 ounces""",9,9,8,9,8,"""Crisply sweet, fruit-toned. Ra…","""This coffee tied for the third…","""A sweetly fruit-forward natura…",39.68325,"""Ethiopia""",56.0,82.0,"""Medium-Light""","""crisply sweet , fruit-toned . …","""coffee tied third-highest rati…","""sweetly fruit-forward natural-…"
96,"""Dragonfly Coffee Roasters""","""Medium-Light""","""$45.00/8 ounces""",9,9,9,10,9,"""Crisp, elegantly sweet, rich-t…","""Produced from trees of the rar…","""Tropical fruit- and spice-tone…",198.41625,"""Panama""",56.0,76.0,"""Medium-Light""","""crisp , elegantly sweet , rich…","""produced tree rare ethiopia-de…","""tropical fruit- spice-toned ar…"


All datasets have the same columns.

Missing values in Embeddings DataFrame:


rating,roaster,roast,est_price,aroma,acid,body,flavor,aftertaste,desc_1,desc_2,desc_3,price_per_kg,country_of_origin,agtron_ground,agtron_whole,roast_by_agtron,processed_desc_1,processed_desc_2,processed_desc_3
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0



Missing values in Topic Modeling DataFrame:


rating,roaster,roast,est_price,aroma,acid,body,flavor,aftertaste,desc_1,desc_2,desc_3,price_per_kg,country_of_origin,agtron_ground,agtron_whole,roast_by_agtron,processed_desc_1,processed_desc_2,processed_desc_3
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0



Missing values in Sentiment Analysis DataFrame:


rating,roaster,roast,est_price,aroma,acid,body,flavor,aftertaste,desc_1,desc_2,desc_3,price_per_kg,country_of_origin,agtron_ground,agtron_whole,roast_by_agtron,processed_desc_1,processed_desc_2,processed_desc_3
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [11]:
# Checking missing values
check_missing_values(df_embeddings, "Embeddings")
check_missing_values(df_topic_modeling, "Topic Modeling")
check_missing_values(df_sentiment, "Sentiment Analysis")


Missing values in Embeddings DataFrame:


rating,roaster,roast,est_price,aroma,acid,body,flavor,aftertaste,desc_1,desc_2,desc_3,price_per_kg,country_of_origin,agtron_ground,agtron_whole,roast_by_agtron,processed_desc_1,processed_desc_2,processed_desc_3
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0



Missing values in Topic Modeling DataFrame:


rating,roaster,roast,est_price,aroma,acid,body,flavor,aftertaste,desc_1,desc_2,desc_3,price_per_kg,country_of_origin,agtron_ground,agtron_whole,roast_by_agtron,processed_desc_1,processed_desc_2,processed_desc_3
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0



Missing values in Sentiment Analysis DataFrame:


rating,roaster,roast,est_price,aroma,acid,body,flavor,aftertaste,desc_1,desc_2,desc_3,price_per_kg,country_of_origin,agtron_ground,agtron_whole,roast_by_agtron,processed_desc_1,processed_desc_2,processed_desc_3
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
