In this notebook, we are going to look at our raw dataset `banks.xlsx` for cleaning purposes. We will be using the `pandas` library to help us with this task.

In [51]:
# import libraries
import numpy as np
import pandas as pd

In [52]:
# load the data
df = pd.read_excel("../data/raw/banks_raw.xlsx")

In [53]:
df.columns

Index(['address', 'categoryName', 'cid', 'city', 'location/lat',
       'location/lng', 'phone', 'rank', 'reviewsCount', 'title', 'totalScore',
       'reviewId', 'publishedAtDate', 'stars', 'text', 'textTranslated',
       'reviewerId', 'reviewerNumberOfReviews', 'name'],
      dtype='object')

We need to change some columns names, look at the data types ...

In [54]:
df.columns = ['bank_adress', 'bank_category', 'bank_id', 'bank_city', 'bank_latitude', 'bank_longitude', 'bank_phone', 'bank_rank', 'bank_reviews_count', 'bank_title', 'bank_score', 'review_id', 'review_date', 'review_stars', 'review_text', 'review_text_translated', 'reviewer_id', 'reviewer_num_reviews', 'reviewer_name']

In [55]:
df.head()

Unnamed: 0,bank_adress,bank_category,bank_id,bank_city,bank_latitude,bank_longitude,bank_phone,bank_rank,bank_reviews_count,bank_title,bank_score,review_id,review_date,review_stars,review_text,review_text_translated,reviewer_id,reviewer_num_reviews,reviewer_name
0,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,"Bank Of Africa, Agence Al Massira, Maarif.",3.1,,2023-02-27T15:53:13.594Z,5.0,Une très belle équipe .\nJe recommande spécial...,,,12.0,
1,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,"Bank Of Africa, Agence Al Massira, Maarif.",3.1,,2022-09-28T09:38:14.263Z,5.0,,,,0.0,
2,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,"Bank Of Africa, Agence Al Massira, Maarif.",3.1,,2022-05-27T18:42:53.897Z,2.0,,,,49.0,
3,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,"Bank Of Africa, Agence Al Massira, Maarif.",3.1,,2022-04-08T13:10:33.584Z,1.0,,,,2.0,
4,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,"Bank Of Africa, Agence Al Massira, Maarif.",3.1,,2022-01-08T15:36:51.089Z,1.0,Mauvais service client,,,2.0,


In [56]:
# drop duplicates reviews
df = df.drop_duplicates()

In [57]:
df['bank_city'].unique()

array(['Casablanca', 'Rabat', 'Rabat Hassan', 'Fès', 'Marrakech',
       'Tanger', 'Al Hoceïma', 'Meknès', 'Safi', 'Nador', 'Kénitra',
       'El Jadida', 'Tétouan', 'Khouribga', 'Zagora', 'Dakhla',
       'Laayoune', 'Laâyoune', 'Settat', 'Guisser', 'Ouarzazate',
       'Guelmim', 'Berrchid', 'Sidi Ifni', 'Berkane', 'Sefrou',
       'Taounate', 'Khémisset', 'Essaouira', 'Boujdour'], dtype=object)

We notice that there are some cities that are written in different ways, we need to fix that.

In [58]:
df.loc[df['bank_city'] == 'Rabat Hassan', 'bank_city'] = 'Rabat'
df.loc[df['bank_city'] == 'Laayoune', 'bank_city'] = 'Laâyoune'
df['bank_city'].unique()

array(['Casablanca', 'Rabat', 'Fès', 'Marrakech', 'Tanger', 'Al Hoceïma',
       'Meknès', 'Safi', 'Nador', 'Kénitra', 'El Jadida', 'Tétouan',
       'Khouribga', 'Zagora', 'Dakhla', 'Laâyoune', 'Settat', 'Guisser',
       'Ouarzazate', 'Guelmim', 'Berrchid', 'Sidi Ifni', 'Berkane',
       'Sefrou', 'Taounate', 'Khémisset', 'Essaouira', 'Boujdour'],
      dtype=object)

Cities are now clean.

In [59]:
df['bank_category'].unique()

array(['Banque', "Service de transfert d'argent",
       "Service de transfert d'argent par mandat", "Banque d'épargne",
       'Poste', 'Institution financière', 'Établissement de crédit',
       'Magasin de téléphonie mobile', 'Conseiller financier',
       'Bibliothèque municipale', 'Bureau de change',
       'Distributeur de billets', nan, 'Attraction touristique',
       'Pizzeria', 'Dépôt-vente'], dtype=object)

We see that here, we got some categories that are not really banks, we need to get rid of them.

In [60]:
df = df.loc[df['bank_category'].isin(['Banque', "Banque d'épargne", 'Institution financière', 'Établissement de crédit', np.nan])]
df['bank_category'].unique()

array(['Banque', "Banque d'épargne", 'Institution financière',
       'Établissement de crédit', nan], dtype=object)

In [61]:
df['bank_title'].unique()

array(['Bank Of Africa, Agence Al Massira, Maarif.', 'Citibank',
       'Al Akhdar Bank agence casa Racine',
       'BANK OF AFRICA, Agence Abdelmoumen, Casablanca.',
       'Attijariwafa Bank', 'Bank Assafa - Agence PALMIER',
       'BANK OF AFRICA', 'BANK OF AFRICA Golf', 'Bank chaabi',
       'Bank of Africa', 'AttijariWafa Bank Blaise Pascal',
       'Bank of Africa la Gironde', 'ATTIJARIWAFA BANK Sidi Abderrahmane',
       'Attijariwafa Bank, Agence Rue De Rome, Mers Sultan.',
       'Banque Attijariwafa Bank', 'Al Barid bank',
       'BANK OF AFRICA RABAT HAY RIYAD',
       'Attijariwafa Bank Rabat Abd. AlKhattabi', 'BMCE Bank',
       'Bank Al Yousr', 'CFG Bank - Agence Rabat Souissi',
       'banque populaire', 'Barid Bank', 'Umnia Bank Rabat Al Manal',
       'AL AKHDAR BANK HAY RIAD', 'Attijariwafa bank', 'Banque Al yousr',
       'BMCE BANK Agence Rabat Centrale', 'Tijariwafa Bank',
       'Umnia Bank (ag. Rabat hay Ryad)', 'BANK OF AFRICA RABAT CENTRALE',
       'Banque Pop

Here, we see that there are many banks with the same name, but written in different ways. We need to fix that. We are going to use the fuzzywuzzy library to help us with that.  
Before that ler's remove the classes that do not refer to banks.

In [62]:
# import the function from utils/functions.py
import sys
sys.path.append('../utils')
import functions

# define a list of bank labels
labels = df['bank_title'].unique()

# create a dictionnary that maps each bank label to its corresponding bank
bank_dict = functions.map_labels_to_banks(labels)

In [63]:
# create a new column that contains the bank name
df['bank_title'] = df['bank_title'].apply(lambda x: bank_dict[x])

# drop the rows that contain unknown banks
df = df.loc[df['bank_title'] != 'unknown']

Next, we look at the other categories related to the bank.

In [64]:
df['bank_phone'].unique()

array([nan, '+212 5224-89600', '+212 5229-99900', '+212 5222-95955',
       '+212 5290-25980', '+212 5229-53570', '+212 5226-39550',
       '+212 5223-04466', '+212 5222-82285', '+212 5377-13575',
       '+212 5224-98004', '+212 5377-14519', '+212 5377-02296',
       '+212 5375-69691', '+212 5380-12910', '+212 5375-76995',
       '+212 5227-76800', '+212 5226-46264', '+212 530-855820',
       '+212 5377-78151', '+212 5372-16152', '+212 5375-75970',
       '+212 5372-31291', '+212 5359-30078', '+212 611-618255',
       '+212 5359-62480', '+212 5357-24090', '+212 5357-37840',
       '+212 5359-40183', '+212 5356-44122', '+212 5224-24243',
       '+212 80-2000047', '+212 5224-24242', '+212 5244-30320',
       '+212 5244-37265', '+212 52980-0755', '+212 5243-14693',
       '+212 5244-18845', '+212 5244-37326', '+212 5244-46025',
       '+212 5243-39180', '+212 5393-28340', '+212 5399-37424',
       '+212 5393-49120', '+212 5399-57649', '+212 5393-28280',
       '+212 5399-32282', '+212 538

In [65]:
df['bank_rank'].unique()

array([20, 19, 18, 17, 16, 14, 15, 13, 12, 11, 10,  9,  8,  7,  6,  5,  3,
        4,  2,  1])

Now, all the fields that are related to the bank are tidy.  
We need to do the same for categories related to reviews and reviewers.  
For reviews text, we are only interested in the text itself, we need a column text which holds the text of the review if it's in French, otherwise, we use the translated text, which is in French.

In [66]:
# create a text column which takes text_translated if it exists, otherwise it takes text
df['review_text'] = df['review_text_translated'].fillna(df['review_text'])

In [67]:
# changing reviewer names to unknown if they are nan
df['reviewer_name'] = df['reviewer_name'].fillna('anonymous')

Now we're all set, we can drop the other columns that we don't need.

In [68]:
df.drop(['review_text_translated', 'review_id', 'reviewer_id', 'reviewer_num_reviews'], axis=1, inplace=True)

In [69]:
df['review_id'] = df.index

In [70]:
df.dtypes

bank_adress            object
bank_category          object
bank_id                uint64
bank_city              object
bank_latitude         float64
bank_longitude        float64
bank_phone             object
bank_rank               int64
bank_reviews_count      int64
bank_title             object
bank_score            float64
review_date            object
review_stars          float64
review_text            object
reviewer_name          object
review_id               int64
dtype: object

Let's work on the date column.

In [71]:
df.loc[df['review_date'].isna()]

Unnamed: 0,bank_adress,bank_category,bank_id,bank_city,bank_latitude,bank_longitude,bank_phone,bank_rank,bank_reviews_count,bank_title,bank_score,review_date,review_stars,review_text,reviewer_name,review_id
76,"H9QW+8XF, Rue Barathon, Casablanca 20250, Maroc",Banque,14450585638160699392,Casablanca,33.588303,-7.602589,,6,0,BMCE Group,,,,,anonymous,76
190,"face à la C.M.R, 04 Av. Al Araar, Rabat 10100,...",Banque,14464757941488660480,Rabat,33.954396,-6.874433,+212 530-855820,9,0,Al Akhdar Bank,,,,,anonymous,190
238,"X43P+CG2, Av. Annakhil, Rabat, Maroc",Banque,12282786406277588992,Rabat,33.953509,-6.863723,,3,0,Attijariwafa Bank,,,,,anonymous,238
241,"260 Ave Mohammed V, Rabat 10020, Maroc",Banque,10759407356930179072,Rabat,34.015919,-6.835043,+212 5372-31291,2,0,BMCE Group,,,,,anonymous,241
369,"Lotissement 07, Fès city Center, champ de Cour...",Banque,12776808075510450176,Fès,34.018125,-5.007845,+212 5224-24243,4,0,Société Générale,,,,,anonymous,369
377,"Angle Avenue Saint Louis et rue 48, Ain chkf, ...",Banque,283667880011130496,Fès,34.022136,-5.006958,+212 5226-46264,1,0,Umnia Bank,,,,,anonymous,377
475,"Avenue Houmane El fetouaki, Arset Lamaâch, 400...",Banque,10771914246444840960,Marrakech,31.621616,-7.988889,+212 5226-46264,10,0,Umnia Bank,,,,,anonymous,475
510,"50 Rue Ibn Aïcha, Marrakech 40000, Maroc50 زنق...",Banque,9444668425060435968,Marrakech,31.637591,-8.011332,,9,0,Attijariwafa Bank,,,,,anonymous,510
563,"M255+8J2, Av. Ibn Sina, Marrakech 40000, Maroc",Banque,5535763029298322432,Marrakech,31.658264,-7.991,,6,0,Attijariwafa Bank,,,,,anonymous,563
749,Angle Route Rgaye et Rue Echahid Ahrach Mohame...,Banque,13043963981056419840,Tanger,35.753185,-5.7997,+212 5226-46264,6,0,Umnia Bank,,,,,anonymous,749


In [72]:
# drop values with no date
df = df.loc[df['review_date'].notna()]

In [73]:
df['review_date'].isna().sum()

0

No missing values, that's good.

In [74]:
# transform the review_date column to datetime
df['review_date'] = pd.to_datetime(df['review_date'], format='%Y-%m-%d', utc=False).dt.date

In [75]:
df.head()

Unnamed: 0,bank_adress,bank_category,bank_id,bank_city,bank_latitude,bank_longitude,bank_phone,bank_rank,bank_reviews_count,bank_title,bank_score,review_date,review_stars,review_text,reviewer_name,review_id
0,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,BMCE Group,3.1,2023-02-27,5.0,Une très belle équipe .\nJe recommande spécial...,anonymous,0
1,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,BMCE Group,3.1,2022-09-28,5.0,,anonymous,1
2,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,BMCE Group,3.1,2022-05-27,2.0,,anonymous,2
3,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,BMCE Group,3.1,2022-04-08,1.0,,anonymous,3
4,"Rue Abou Abdellah Nafii, Casablanca 20250, Maroc",Banque,608978182545380096,Casablanca,33.585603,-7.633682,,20,8,BMCE Group,3.1,2022-01-08,1.0,Mauvais service client,anonymous,4


In [76]:
# reorder the columns
df = df[['bank_id', 'bank_title', 'bank_category', 'bank_adress', 'bank_city', 'bank_phone', 'bank_rank', 'bank_reviews_count', 'bank_score', 'bank_latitude', 'bank_longitude', 'review_id', 'review_date', 'review_stars', 'review_text', 'reviewer_name']]

We need to do one more thing in order to make our data warehouse building easier, it's to separate this dataset into 3 datasets, one for banks, one for reviews and one for reviewers.

In [77]:
banks_df = df[['bank_id', 'bank_title', 'bank_category', 'bank_adress', 'bank_city', 'bank_phone', 'bank_rank', 'bank_reviews_count', 'bank_score', 'bank_latitude', 'bank_longitude']]
reviews_df = df[['review_id', 'bank_id', 'review_date', 'review_stars', 'review_text', 'reviewer_name']]

In [78]:
banks_df = banks_df.drop_duplicates()

In [79]:
# create a directory for the processed data (added in .gitignore)
!mkdir ../data/processed

mkdir: cannot create directory ‘../data/processed’: File exists


In [80]:
banks_df.to_excel("../data/processed/banks.xlsx", index=False)
reviews_df.to_excel("../data/processed/reviews.xlsx", index=False)