## RS Project
##### Valentin Tharaud - Luca Pedranzini


## About the dataset
X-Wines: A wine dataset for recommender systems and machine learning<br>
Rogério Xavier de Azambuja (rogerio.xavier@farroupilha.ifrs.edu.br)<br>
Dataset X-Wines from https://github.com/rogerioxavier/X-Wines<br><br>

Link to the data necessary to run the project: https://drive.google.com/drive/folders/1iC90-CMZOpFd3fJlpntbAl8J1VxfcWth?usp=sharing

In [1]:
# Basic libraries
import pandas as pd

# Display basic configs
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', 100) # default= None
pd.options.mode.chained_assignment = None  # default='warn'

In [2]:
# Opening X-Wines dataset
wines = pd.read_csv("XWines_Slim_1K_wines.csv", low_memory=False, encoding="utf-8", memory_map=True)
len(wines)

1007

In [3]:
print("Total wines:", wines.WineID.nunique(), "from", wines.Code.nunique() ,"different countries")

Total wines: 1007 from 31 different countries


In [5]:
wines.info(), wines.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1007 entries, 0 to 1006
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   WineID      1007 non-null   int64  
 1   WineName    1007 non-null   object 
 2   Type        1007 non-null   object 
 3   Elaborate   1007 non-null   object 
 4   Grapes      1007 non-null   object 
 5   Harmonize   1007 non-null   object 
 6   ABV         1007 non-null   float64
 7   Body        1007 non-null   object 
 8   Acidity     1007 non-null   object 
 9   Code        1007 non-null   object 
 10  Country     1007 non-null   object 
 11  RegionID    1007 non-null   int64  
 12  RegionName  1007 non-null   object 
 13  WineryID    1007 non-null   int64  
 14  WineryName  1007 non-null   object 
 15  Website     900 non-null    object 
 16  Vintages    1007 non-null   object 
dtypes: float64(1), int64(3), object(13)
memory usage: 133.9+ KB


(None,
 WineID          0
 WineName        0
 Type            0
 Elaborate       0
 Grapes          0
 Harmonize       0
 ABV             0
 Body            0
 Acidity         0
 Code            0
 Country         0
 RegionID        0
 RegionName      0
 WineryID        0
 WineryName      0
 Website       107
 Vintages        0
 dtype: int64)

In [6]:
wines.Type.value_counts()

Type
Red             506
White           232
Sparkling       100
Rosé             85
Dessert          59
Dessert/Port     25
Name: count, dtype: int64

In [7]:
wines.Body.value_counts()

Body
Full-bodied          392
Medium-bodied        348
Very full-bodied     134
Light-bodied         118
Very light-bodied     15
Name: count, dtype: int64

In [8]:
wines.Acidity.value_counts()

Acidity
High      740
Medium    249
Low        18
Name: count, dtype: int64

In [9]:
wines.Country.value_counts().sort_index()

Country
Argentina          50
Australia          48
Austria            18
Brazil             48
Canada             19
Chile              69
Croatia             1
Czech Republic      2
France            179
Germany            54
Greece             13
Hungary             8
Israel              1
Italy             139
Lebanon             4
Malta               5
Mexico              8
Moldova             1
New Zealand        33
Portugal           83
Romania             2
Russia              7
Slovenia            1
South Africa       32
Spain              48
Switzerland         4
Turkey              1
Ukraine             6
United Kingdom      1
United States     114
Uruguay             8
Name: count, dtype: int64

## Project Goals

The primary goal of this project is to develop a system that can identify wines similar to a given wine from the dataset.

### Approaches

1. **K-Nearest Neighbors (KNN) on Principal Component Analysis (PCA)**:
   - Reduce the dimensionality of the dataset using PCA.
   - Use KNN to find wines similar to the target wine based on the reduced dimensions.

2. **Cosine Similarity on Singular Value Decomposition (SVD)**:
   - Apply SVD to decompose the dataset.
   - Compute cosine similarity between wines based on the resulting components.

3. **Cosine Similarity with Neural Network Embeddings**:
   - Train a neural network to learn embeddings for the wines.
   - Use cosine similarity to find similar wines based on these learned embeddings.

### Expected Outcome

Since both the KNN on PCA and the cosine similarity on SVD approaches involve computing cosine distances, we anticipate that they will yield similar results. The experiment with neural network embeddings aims to compare and possibly enhance the effectiveness of similarity detection using learned representations.