# <div style="text-align: center; background-color:#67001f; font-family:monospace; color: white; padding: 14px; line-height: 1;border-radius:20px">🕮 NLP Text Clustering & Web Scraping Project</div>

## <div style="text-align: left;background-color:#371de3; font-family:monospace; color: white; padding: 14px; line-height: 1; border-radius:10px"> Architecture du Projet 🏗️ </div>

![architecture Data](asset/architecture_data.png)

## <div style="text-align: left;background-color:#371de3; font-family:monospace; color: white; padding: 14px; line-height: 1; border-radius:10px"> Objectifs 🎯 </div>

1. Préparer les données pour le **clustering de texte**.
2. Entraîner un modèle de **clustering** avec **KMeans**.
6. Etiqueter les données et stocker les informations dans la base de données.

##  <div style="text-align: left;background-color:#371de3; font-family:monospace; color: white; padding: 14px; line-height: 1; border-radius:10px"> Résultats 📊</div>

- **Database** : Les données sont collectées via le site web, notoyées, scruturées et stockées dans Snowflake.
- **Clustering** : Les livres sont regroupés en clusters basés sur des caractéristiques textuelles.
- **Exposition API** :
  - Endpoint : `/clusters` → Retourne les clusters de livres.
  - Endpoint : `/predict` → Prédit le cluster d’un livre donné.


# <div style="text-align: center; background-color:#b2182b; font-family:newtimeroman; color: white; padding: 14px; line-height: 1;border-radius:20px">0. Import Necessary Libraries</div>

In [1]:
# dataframe
import pandas as pd
import numpy as np
from uuid import uuid4

# data visualization
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import plot
from wordcloud import WordCloud
from PIL import Image

# request, websraping
import requests
from bs4 import BeautifulSoup, NavigableString, Tag
import re

# tensorflow, for NN
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, GlobalMaxPooling1D, Dropout


# NLTK, NLP Libraries
import nltk
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')

# sklearn, for preprocessing & scoring
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\vital.guinguinni\AppData\Roaming\nltk_data...


# <div style="text-align: center; background-color:#b2182b; font-family:newtimeroman; color: white; padding: 14px; line-height: 1;border-radius:20px">2. Univariate Analysis</div>

In [None]:
Book_Data = pd.read_csv(r"\books_dataframe_cleaned")

## <div style="text-align: left; background-color: #a33939; font-family:newtimeroman; color: white; padding: 14px; line-height: 1;border-radius:15px"> 2.1. Explore each Numeric Columns </div>

In [None]:
# describtion of 5 numeric columns
describe=Book_Data.describe(include =['float', 'int'])
describe.T.style.background_gradient(low=0.2,high=0.5,cmap = 'rocket_r')

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,1000.0,35.07035,14.44669,10.0,22.1075,35.98,47.4575,59.99
nb_in_stock,1000.0,8.585,5.654622,1.0,3.0,7.0,14.0,22.0
rating,1000.0,2.923,1.434967,1.0,2.0,3.0,4.0,5.0


## <div style="text-align: left; background-color:#d6604d; font-family:newtimeroman; color: white; padding: 8px; line-height: 1;border-radius:5px"> 2.1.1. Rating Distribution</div>

In [None]:
# Value Count
rating_df = pd.DataFrame(Book_Data["rating"].value_counts().sort_index()).reset_index()

# 1. Rating
fig1 = make_subplots(rows=1, cols=2, specs=[[{"type": "pie"}, {"type": "pie"}]])
fig1.add_trace(go.Pie(values=rating_df['count'],
             labels=rating_df['rating'], 
             marker=dict(colors=['#fddbc7','#f4a582','#d6604d','#b2182b','#67001f']),
             title= 'Rating',
             titlefont= dict(size=17)),row=1,col=1)


# 2. bar plot
fig2 = px.bar(x = rating_df['rating'], 
              y = rating_df['count'], 
              text = rating_df['count'], color = rating_df['rating'],
              color_discrete_sequence = px.colors.sequential.RdBu,
              template = "simple_white",
              title = 'rating Bar Plot')

fig2.update_layout(
    xaxis_title="rating Range",
    yaxis_title="count",
    font=dict(size=17,family="Franklin Gothic"))

fig1.show()
fig2.show()

In [None]:
recommended = Book_Data[Book_Data['Recommended IND'] == 1]
recommended_n = Book_Data[Book_Data['Recommended IND']==0]

hist_data = [recommended['Text_Length'], recommended_n['Text_Length']]
group_labels = ['Text of Recommended Comments','Text of Unrecommended Comments']

fig = ff.create_distplot(hist_data, group_labels,show_hist = False, colors=['#2166ac','#b2182b'])
fig.update_layout(title = 'Text Length by Recommended IND',
                  font = dict(size=17, family = 'Franklin Gothic'),template = "simple_white") 
fig.show()

In [None]:
# 1.radar plot
fig.show()
# 2. bar plot
fig2.show()

In [None]:
## Text Clustering

## <div style="text-align: left; background-color: #a33939; font-family:newtimeroman; color: white; padding: 14px; line-height: 1;border-radius:15px"> 5.2. Train-Test Split </div>

## <div style="text-align: left; background-color: #a33939; font-family:newtimeroman; color: white; padding: 14px; line-height: 1;border-radius:15px"> 5.3. Tokenization, Sequencing and Padding</div>

## <div style="text-align: left; background-color: #a33939; font-family:newtimeroman; color: white; padding: 14px; line-height: 1;border-radius:15px"> 5.4. GloVe Embedding </div>

# <div style="text-align: center; background-color:#b2182b; font-family:newtimeroman; color: white; padding: 14px; line-height: 1;border-radius:20px">3. Bivariate Analysis</div>

# <div style="text-align: center; background-color:#b2182b; font-family:newtimeroman; color: white; padding: 14px; line-height: 1;border-radius:20px">4. Text Clustering</div>

# <div style="text-align: center; background-color:#b2182b; font-family:newtimeroman; color: white; padding: 14px; line-height: 1;border-radius:20px"> 5. Text Classification </div>

## <div style="text-align: left; background-color: #a33939; font-family:newtimeroman; color: white; padding: 14px; line-height: 1;border-radius:15px"> 6.1. Define & Train Model </div>

## <div style="text-align: left; background-color: #a33939; font-family:newtimeroman; color: white; padding: 14px; line-height: 1;border-radius:15px"> 6.2. Model Evaluation</div>


## <div style="text-align: left; background-color:#d6604d; font-family:newtimeroman; color: white; padding: 8px; line-height: 1;border-radius:5px"> 6.2.1. Compare AUC & Loss Score</div>

## <div style="text-align: left; background-color:#d6604d; font-family:newtimeroman; color: white; padding: 8px; line-height: 1;border-radius:5px"> 6.2.2. Compare each Scores </div>

## <div style="text-align: center; background-color:#ECF0F1 ; font-family:newtimeroman; color: black; padding: 40px; line-height: 1;border-radius:40px"> 🙇Thank You For Watching ! <br><br>Please upvote if you like this notebook !</div>