# Modeling: Similarities between Portuguese attractions
The goal of this notebook is find similiarities between Portuguese attractions.

The necessary imports are provided below. We are working with the `cleaned_data.csv` file created during the Data Understanding and Data Preparation notebook.

In [9]:
import pandas as pd
from sklearn.metrics import pairwise_distances
from IPython.display import display

We decided to not subset the Portuguese rows using the attractions' IDs because we want to compare on of them with others from another country.  

In [10]:
df = pd.read_csv("cleaned_data.csv", delimiter=";")

#local_ids = ["MAG010", "MAG014", "MAG021", "MAG032", "MAG047", "MAG049", "MAG093"]
#df_filtered = df[df['localID'].isin(local_ids) ]

Since the `userName` was defined by Name + @ + Name and: "The first is the public name of the user. The second is the TripAdvisor unique identifier of the user." We only consider the name after the @.

In [11]:
df['userName'] = df['userName'].apply(lambda x: x.split('@')[-1])

We created a attraction similarity matrix based on `localID` and average `reviewRating`.

In [12]:
customerProductMatrix = df.pivot_table(
    index='userName', 
    columns='localID', 
    values='reviewRating', 
    aggfunc='mean'  
).fillna(0)  

product_product_sim_matrix = pd.DataFrame(
    pairwise_distances(customerProductMatrix.T,metric='cosine'),
    columns = customerProductMatrix.columns,
    index = customerProductMatrix.columns
    )
product_product_sim_matrix = product_product_sim_matrix.apply(lambda x: 1-x, axis=1) 
display(product_product_sim_matrix)


localID,MAG001,MAG002,MAG004,MAG008,MAG009,MAG010,MAG011,MAG012,MAG014,MAG015,...,MAG087,MAG089,MAG093,MAG094,MAG095,MAG096,MAG097,MAG098,MAG099,MAG100
localID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
MAG001,1.0,0.323462,0.214904,0.034096,0.007725,0.012701,0.008023,0.023669,0.013251,0.020694,...,0.006831,0.01496,0.001045,0.0,0.000652,0.006013,0.01156,0.0,0.0,0.0
MAG002,0.323462,1.0,0.193076,0.024199,0.006984,0.010903,0.00268,0.023914,0.011455,0.016903,...,0.005023,0.013121,0.0,0.0,0.0,0.006077,0.016368,0.0,0.0,0.0
MAG004,0.214904,0.193076,1.0,0.018766,0.002857,0.005476,0.00258,0.020136,0.009374,0.017011,...,0.007159,0.020829,0.0,0.0,0.0,0.002949,0.012743,0.0,0.0,0.0
MAG008,0.034096,0.024199,0.018766,1.0,0.00277,0.011432,0.000625,0.074005,0.015122,0.125085,...,0.001281,0.033587,0.0,0.0,0.0,0.0,0.003218,0.029234,0.0,0.0
MAG009,0.007725,0.006984,0.002857,0.00277,1.0,0.005999,0.009267,0.002598,0.004937,0.002072,...,0.00285,0.0,0.003182,0.001129,0.006732,0.020146,0.0,0.0,0.0,0.0
MAG010,0.012701,0.010903,0.005476,0.011432,0.005999,1.0,0.00127,0.015299,0.278435,0.015124,...,0.002711,0.009207,0.00681,0.003581,0.000906,0.0,0.004084,0.021896,0.0,0.0
MAG011,0.008023,0.00268,0.00258,0.000625,0.009267,0.00127,1.0,0.002111,0.003197,0.0,...,0.00175,0.001829,0.001954,0.0,0.008043,0.002343,0.0,0.0,0.0,0.0
MAG012,0.023669,0.023914,0.020136,0.074005,0.002598,0.015299,0.002111,1.0,0.014708,0.093494,...,0.0,0.025845,0.0,0.0,0.001116,0.0,0.01569,0.011067,0.008401,0.0
MAG014,0.013251,0.011455,0.009374,0.015122,0.004937,0.278435,0.003197,0.014708,1.0,0.017178,...,0.00211,0.006615,0.010993,0.004645,0.003526,0.0,0.0,0.01794,0.0,0.0
MAG015,0.020694,0.016903,0.017011,0.125085,0.002072,0.015124,0.0,0.093494,0.017178,1.0,...,0.006652,0.038583,0.003789,0.0,0.0,0.0,0.006545,0.019851,0.0,0.0


Which is basicaly the transpose for the user to user similarity matrix. Then we print the top 3 most similar items to `MAG047` attraction.

In [21]:
top_3_similar_items = list(
    product_product_sim_matrix
        .loc['MAG047']
        .sort_values(ascending=False)
        .iloc[1:11]         # 1 to 11 instead of 0 to 10 because the first is the product itself
    .index
)
print(top_3_similar_items)

['MAG032', 'MAG010', 'MAG014', 'MAG021', 'MAG049', 'MAG093', 'MAG052', 'MAG023', 'MAG059', 'MAG004']


Our discussion is on our report.