In this notebook, we are going to look at our raw dataset `banks.xlsx` for cleaning purposes. We will be using the `pandas` library to help us with this task.

In [85]:
# import libraries
import numpy as np
import pandas as pd

In [86]:
# load the data
df = pd.read_excel("../data/raw/banks_raw.xlsx")

In [87]:
df.columns

Index(['address', 'categoryName', 'cid', 'city', 'location/lat',
       'location/lng', 'phone', 'rank', 'reviewsCount', 'title', 'totalScore',
       'reviewId', 'publishedAtDate', 'stars', 'text', 'textTranslated',
       'reviewerId', 'reviewerNumberOfReviews', 'name'],
      dtype='object')

We need to change some columns names, look at the data types ...

In [88]:
df.columns = ['bank_adress', 'bank_category', 'bank_id', 'bank_city', 'latitude', 'longitude', 'bank_phone', 'bank_rank', 'reviews_count', 'bank_title', 'bank_score', 'review_id', 'review_date', 'review_stars', 'review_text', 'review_text_translated', 'reviewer_id', 'reviewer_num_reviews', 'reviewer_name']

In [89]:
df.head()

Unnamed: 0,bank_adress,bank_category,bank_id,bank_city,latitude,longitude,bank_phone,bank_rank,reviews_count,bank_title,bank_score,review_id,review_date,review_stars,review_text,review_text_translated,reviewer_id,reviewer_num_reviews,reviewer_name
0,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,"Bank Of Africa, Agence Al Massira, Maarif.",3.1,,2023-02-27T15:53:13.594Z,5.0,Une très belle équipe .\nJe recommande spécial...,,,12.0,
1,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,"Bank Of Africa, Agence Al Massira, Maarif.",3.1,,2022-09-28T09:38:14.263Z,5.0,,,,0.0,
2,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,"Bank Of Africa, Agence Al Massira, Maarif.",3.1,,2022-05-27T18:42:53.897Z,2.0,,,,49.0,
3,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,"Bank Of Africa, Agence Al Massira, Maarif.",3.1,,2022-04-08T13:10:33.584Z,1.0,,,,2.0,
4,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,"Bank Of Africa, Agence Al Massira, Maarif.",3.1,,2022-01-08T15:36:51.089Z,1.0,Mauvais service client,,,2.0,


In [90]:
# drop duplicates reviews
df = df.drop_duplicates()

In [91]:
df['bank_city'].unique()

array(['Casablanca', 'Rabat', 'Rabat Hassan', 'Fès', 'Marrakech',
       'Tanger', 'Al Hoceïma', 'Meknès', 'Safi', 'Nador', 'Kénitra',
       'El Jadida', 'Tétouan', 'Khouribga', 'Zagora', 'Dakhla',
       'Laayoune', 'Laâyoune', 'Settat', 'Guisser', 'Ouarzazate',
       'Guelmim', 'Berrchid', 'Sidi Ifni', 'Berkane', 'Sefrou',
       'Taounate', 'Khémisset', 'Essaouira', 'Boujdour'], dtype=object)

We notice that there are some cities that are written in different ways, we need to fix that.

In [92]:
df.loc[df['bank_city'] == 'Rabat Hassan', 'bank_city'] = 'Rabat'
df.loc[df['bank_city'] == 'Laayoune', 'bank_city'] = 'Laâyoune'
df['bank_city'].unique()

array(['Casablanca', 'Rabat', 'Fès', 'Marrakech', 'Tanger', 'Al Hoceïma',
       'Meknès', 'Safi', 'Nador', 'Kénitra', 'El Jadida', 'Tétouan',
       'Khouribga', 'Zagora', 'Dakhla', 'Laâyoune', 'Settat', 'Guisser',
       'Ouarzazate', 'Guelmim', 'Berrchid', 'Sidi Ifni', 'Berkane',
       'Sefrou', 'Taounate', 'Khémisset', 'Essaouira', 'Boujdour'],
      dtype=object)

Cities are now clean.

In [93]:
df['bank_category'].unique()

array(['Banque', "Service de transfert d'argent",
       "Service de transfert d'argent par mandat", "Banque d'épargne",
       'Poste', 'Institution financière', 'Établissement de crédit',
       'Magasin de téléphonie mobile', 'Conseiller financier',
       'Bibliothèque municipale', 'Bureau de change',
       'Distributeur de billets', nan, 'Attraction touristique',
       'Pizzeria', 'Dépôt-vente'], dtype=object)

We see that here, we got some categories that are not really banks, we need to get rid of them.

In [94]:
df = df.loc[df['bank_category'].isin(['Banque', "Banque d'épargne", 'Institution financière', 'Établissement de crédit', np.nan])]
df['bank_category'].unique()

array(['Banque', "Banque d'épargne", 'Institution financière',
       'Établissement de crédit', nan], dtype=object)

In [95]:
df['bank_title'].unique()

array(['Bank Of Africa, Agence Al Massira, Maarif.', 'Citibank',
       'Al Akhdar Bank agence casa Racine',
       'BANK OF AFRICA, Agence Abdelmoumen, Casablanca.',
       'Attijariwafa Bank', 'Bank Assafa - Agence PALMIER',
       'BANK OF AFRICA', 'BANK OF AFRICA Golf', 'Bank chaabi',
       'Bank of Africa', 'AttijariWafa Bank Blaise Pascal',
       'Bank of Africa la Gironde', 'ATTIJARIWAFA BANK Sidi Abderrahmane',
       'Attijariwafa Bank, Agence Rue De Rome, Mers Sultan.',
       'Banque Attijariwafa Bank', 'Al Barid bank',
       'BANK OF AFRICA RABAT HAY RIYAD',
       'Attijariwafa Bank Rabat Abd. AlKhattabi', 'BMCE Bank',
       'Bank Al Yousr', 'CFG Bank - Agence Rabat Souissi',
       'banque populaire', 'Barid Bank', 'Umnia Bank Rabat Al Manal',
       'AL AKHDAR BANK HAY RIAD', 'Attijariwafa bank', 'Banque Al yousr',
       'BMCE BANK Agence Rabat Centrale', 'Tijariwafa Bank',
       'Umnia Bank (ag. Rabat hay Ryad)', 'BANK OF AFRICA RABAT CENTRALE',
       'Banque Pop

Here, we see that there are many banks with the same name, but written in different ways. We need to fix that. We are going to use the fuzzywuzzy library to help us with that.  
Before that ler's remove the classes that do not refer to banks.

In [96]:
# import the function from utils/functions.py
import sys
sys.path.append('../utils')
import functions

# define a list of bank labels
labels = df['bank_title'].unique()

# create a dictionnary that maps each bank label to its corresponding bank
bank_dict = functions.map_labels_to_banks(labels)

In [97]:
# create a new column that contains the bank name
df['bank_title'] = df['bank_title'].apply(lambda x: bank_dict[x])

# drop the rows that contain unknown banks
df = df.loc[df['bank_title'] != 'unknown']

Next, we look at the other categories related to the bank.

In [98]:
df['bank_phone'].unique()

array([nan, '+212 5224-89600', '+212 5229-99900', '+212 5222-95955',
       '+212 5290-25980', '+212 5229-53570', '+212 5226-39550',
       '+212 5223-04466', '+212 5222-82285', '+212 5377-13575',
       '+212 5224-98004', '+212 5377-14519', '+212 5377-02296',
       '+212 5375-69691', '+212 5380-12910', '+212 5375-76995',
       '+212 5227-76800', '+212 5226-46264', '+212 530-855820',
       '+212 5377-78151', '+212 5372-16152', '+212 5375-75970',
       '+212 5372-31291', '+212 5359-30078', '+212 611-618255',
       '+212 5359-62480', '+212 5357-24090', '+212 5357-37840',
       '+212 5359-40183', '+212 5356-44122', '+212 5224-24243',
       '+212 80-2000047', '+212 5224-24242', '+212 5244-30320',
       '+212 5244-37265', '+212 52980-0755', '+212 5243-14693',
       '+212 5244-18845', '+212 5244-37326', '+212 5244-46025',
       '+212 5243-39180', '+212 5393-28340', '+212 5399-37424',
       '+212 5393-49120', '+212 5399-57649', '+212 5393-28280',
       '+212 5399-32282', '+212 538

In [99]:
df['bank_rank'].unique()

array([20, 19, 18, 17, 16, 14, 15, 13, 12, 11, 10,  9,  8,  7,  6,  5,  3,
        4,  2,  1])

In [100]:
df['reviews_count'].unique()

array([ 8, 14,  2, 10,  1,  3,  0,  7,  6,  5,  4, 18, 12,  9, 13, 28, 22,
       11, 15, 27, 34, 20, 32, 21, 24, 33, 16, 78, 31, 51, 60, 23, 26, 17])

Now, all the fields that are related to the bank are tidy.  
We need to do the same for categories related to reviews and reviewers.  
For reviews text, we are only interested in the text itself, we need a column text which holds the text of the review if it's in French, otherwise, we use the translated text, which is in French.

In [101]:
# create a text column which takes text_translated if it exists, otherwise it takes text
df['review_text'] = df['review_text_translated'].fillna(df['review_text'])

In [102]:
# changing reviewer names to unknown if they are nan
df['reviewer_name'] = df['reviewer_name'].fillna('unknown')

Now we're all set, we can drop the other columns that we don't need.

In [103]:
df.drop(['review_text_translated', 'review_id', 'reviewer_id'], axis=1, inplace=True)

In [104]:
df['review_id'] = df.index

In [105]:
df.dtypes

bank_adress              object
bank_category            object
bank_id                  uint64
bank_city                object
latitude                float64
longitude               float64
bank_phone               object
bank_rank                 int64
reviews_count             int64
bank_title               object
bank_score              float64
review_date              object
review_stars            float64
review_text              object
reviewer_num_reviews    float64
reviewer_name            object
review_id                 int64
dtype: object

We need to save our data in a csv file.

In [106]:
# reorder the columns
df = df[['bank_id', 'bank_title', 'bank_category', 'bank_adress', 'bank_city', 'bank_phone', 'bank_rank', 'reviews_count', 'bank_score', 'latitude', 'longitude', 'review_id', 'review_date', 'review_stars', 'review_text', 'reviewer_name', 'reviewer_num_reviews']]

We need to do one more thing in order to make our data warehouse building easier, it's to separate this dataset into 3 datasets, one for banks, one for reviews and one for reviewers.

In [107]:
banks_df = df[['bank_id', 'bank_title', 'bank_category', 'bank_adress', 'bank_city', 'bank_phone', 'bank_rank', 'reviews_count', 'bank_score', 'latitude', 'longitude']]
reviews_df = df[['review_id', 'bank_id', 'review_date', 'review_stars', 'review_text', 'reviewer_name', 'reviewer_num_reviews']]

In [108]:
banks_df = banks_df.drop_duplicates()

In [109]:
banks_df.to_excel("../data/processed/banks.xlsx", index=False)

Now we have a clean dataset, with banks and reviews on single dataframes.  
The next step is to perform a sentiment analysis on the reviews, and then we can build our data warehouse.  
For this we are going to create a new column called `sentiment` which will hold the sentiment of the review.