# Dataset Preparation
This notebook aims to create 2 separated datasets. 

1. **Prepared_Dataset.csv**, which consists of the books with their metadata including the description texts, for general use. 

2.  **Descriptions_Dataset.csv**, which only consists of the books with their description texts, to NLP-based projects.

You will find them in [Data/Books_Data](https://github.com/iHaifaa/Arabic_Books_Recommendation_System/tree/main/Data/Books_Data) folder.

In [1]:
# Required pakages
import pandas as pd
import numpy as np
import os
import glob # To browse the floders

## 1. Loading the dataset/s

In [8]:
path = 'Data/Books_Data'

all_files = glob.glob(path + "/Books_Category*.csv")   
#print(len(all_files))
li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0) # dtype={'Average_Ratings': 'category'} to rid off out of memory
    li.append(df)

full_dataset = pd.concat(li, axis=0, ignore_index=True)

In [9]:
full_dataset.shape

(17869, 19)

In [10]:
full_dataset.head()

Unnamed: 0.2,Unnamed: 0,ISBN,Title,Author,Authors_Number,Description,Genres,Average_Ratings,Reviews_Number,Quotes_Number,Community_Size,Pages_Number,Editions,Publication_Year,Publisher,URL,Cover_URL,Unnamed: 0.1,Readers_Number
0,0,ISBN 9774416333,استمتع بحياتك,تأليف محمد عبد الرحمن العريفي (تأليف),المؤلفون\n1,لما كنت فى السادسة عشرة من عمري وقع فى يدي كتا...,علوم إسلامية رقائق,4.0,مراجعات\n97,اقتباسات\n39,القرّاء\n7945,344 صفحة,طبعات\n4,نشر سنة 2009,,https://www.abjjad.com/book/15445916/%D8%A7%D8...,https://abjjadst.blob.core.windows.net/pub/2f2...,,
1,1,ISBN 13 9789777195522,عبقرية عمر,تأليف عباس محمود العقاد (تأليف),المؤلفون\n1,يزخر التاريخ الإسلامي برجال عِظام سطروا حوادثه...,علوم إسلامية سيرة الصحابة,4.1,مراجعات\n64,اقتباسات\n114,القرّاء\n7337,497 صفحة,طبعات\n3,نشر سنة 2014,مؤسسة هنداوي للتعليم والثقافة,https://www.abjjad.com/book/15445340/%D8%B9%D8...,https://abjjadst.blob.core.windows.net/pub/e5f...,,
2,2,ISBN 13 9789777195331,عبقرية محمد,تأليف عباس محمود العقاد (تأليف),المؤلفون\n1,احتفى التاريخ العربي بالسيرة المُحمدية؛ فأفرد ...,علوم إسلامية السيرة النبوية,4.3,مراجعات\n45,اقتباسات\n62,القرّاء\n4151,238 صفحة,طبعات\n2,نشر سنة 2013,مؤسسة هنداوي للتعليم والثقافة,https://www.abjjad.com/book/15445339/%D8%B9%D8...,https://abjjadst.blob.core.windows.net/pub/7df...,,
3,3,دار الحضارة للنشر والتوزيع,لأنك الله : رحلة إلى السماء السابعة,تأليف علي بن جابر الفيفي (تأليف) علي بن جابر ا...,المؤلفون\n2,كتاب يتحدث عن بعض أسماء الله الحسنى وكيف نعيشه...,علوم إسلامية رقائق,4.7,مراجعات\n83,اقتباسات\n99,القرّاء\n3133,192 صفحة,طبعات\n1,نشر سنة 2016,,https://www.abjjad.com/book/2653683959/%D9%84%...,https://abjjadst.blob.core.windows.net/pub/c35...,,
4,4,ISBN 13 9789777194693,عبقرية الإمام علي,تأليف عباس محمود العقاد (تأليف),المؤلفون\n1,بَرَع «عباس محمود العقاد» في تناول شخصية الإما...,علوم إسلامية سيرة الصحابة,4.3,مراجعات\n30,اقتباسات\n25,القرّاء\n2592,87 صفحة,طبعات\n4,نشر سنة 2013,مؤسسة هنداوي للتعليم والثقافة,https://www.abjjad.com/book/15445343/%D8%B9%D8...,https://abjjadst.blob.core.windows.net/pub/0b2...,,


## 2. Drop unwanted columns and rows

In [11]:
full_dataset = full_dataset.drop(columns=['Unnamed: 0', 'Unnamed: 0.1', 'Readers_Number'])

In [12]:
# Remove empties
full_dataset = full_dataset.dropna(axis=1, how='all') #drop any column filled only with NaN 
full_dataset = full_dataset.dropna(axis=0, how='all') #drop any rows filled only with NaN 

# Remove duplicated observations 
full_dataset = full_dataset.drop_duplicates() # was 10242 rows - became 9815 rows

In [13]:
full_dataset.shape

(16558, 16)

## 3. Correcting the wrong data
Some data was mistakely observed to be in wrong collumn, so I fixed the wrong data by place them in their correspondig columns and fill the empties with None.

In [14]:
# Some ISBN data have publisher information
# I preserved the data from the ISBN column if it does not contain word 'ISBN'. In other words, they are publisher information
publisher_info = full_dataset[full_dataset["ISBN"].str.contains('ISBN')==False]['ISBN']
# I assigned the preserved data to be filled in Publisher column
full_dataset.loc[full_dataset['ISBN'].str.contains('ISBN')==False, 'Publisher'] = publisher_info
# Just display the Publisher column after modifying 
full_dataset[full_dataset["ISBN"].str.contains('ISBN')==False]['Publisher']
# Finally, we will fill the corresponding ISBN with None
full_dataset.loc[full_dataset['ISBN'].str.contains('ISBN')==False, 'ISBN'] = None

In [15]:
# Some Publication_Year data have ISBN information
# I preserved the data from the Publication_Year column if it contain word 'ISBN'. In other words, they are ISBN information
isbn_info = full_dataset[full_dataset["Publication_Year"].str.contains('ISBN')==True]['Publication_Year']
# I assigned the preserved data to be filled in ISBN column
full_dataset.loc[full_dataset['Publication_Year'].str.contains('ISBN')==True, 'ISBN'] = isbn_info
# Just display the ISBN column after modifying 
full_dataset[full_dataset["Publication_Year"].str.contains('ISBN')==True]['ISBN']
# Finally, we will fill the corresponding Publication_Year with None
full_dataset.loc[full_dataset['Publication_Year'].str.contains('ISBN')==True, 'Publication_Year'] = None

In [16]:
# Some Publication_Year data have publisher information
# I preserved the data from the Publication_Year column if it does not contain word 'نشر سنة'. In other words, they are publisher information
publisher_info2 = full_dataset[full_dataset["Publication_Year"].str.contains('نشر سنة')==False]['Publication_Year']
# I assigned the preserved data to be filled in Publisher column
full_dataset.loc[full_dataset['Publication_Year'].str.contains('Publication_Year')==False, 'Publisher'] = publisher_info2
# Just display the Publisher column after modifying 
full_dataset[full_dataset["Publication_Year"].str.contains('نشر سنة')==False]['Publisher']
# Finally, we will fill the corresponding ISBN with None
full_dataset.loc[full_dataset['Publication_Year'].str.contains('نشر سنة')==False, 'Publication_Year'] = None

In [17]:
# Some Pages_Number data have publisher information
# I preserved the data from the Pages_Number column if it does not contain word 'صفحة'. In other words, they are publisher information
publisher_info3 = full_dataset[full_dataset["Pages_Number"].str.contains('صفحة')==False]['Pages_Number']
# I assigned the preserved data to be filled in Publisher column
full_dataset.loc[full_dataset['Pages_Number'].str.contains('صفحة')==False, 'Publisher'] = publisher_info3
# Just display the Publisher column after modifying 
full_dataset[full_dataset["Pages_Number"].str.contains('صفحة')==False]['Publisher']
# Finally, we will fill the corresponding ISBN with None
full_dataset.loc[full_dataset['Pages_Number'].str.contains('صفحة')==False, 'Pages_Number'] = None

In [18]:
# To check if we still have wrong data in the wrong column
# full_dataset.sort_values(by=['Publication_Year'], ascending=False)

## 4. Remove meaningless data or words 

In [19]:
# Keep only the first author, since the second one might be a translator or co-author but appear as main author
full_dataset['Author'] = full_dataset['Author'].str.split('\(تأليف\)').str[0].str.strip()

In [20]:
full_dataset.head()

Unnamed: 0,ISBN,Title,Author,Authors_Number,Description,Genres,Average_Ratings,Reviews_Number,Quotes_Number,Community_Size,Pages_Number,Editions,Publication_Year,Publisher,URL,Cover_URL
0,ISBN 9774416333,استمتع بحياتك,تأليف محمد عبد الرحمن العريفي,المؤلفون\n1,لما كنت فى السادسة عشرة من عمري وقع فى يدي كتا...,علوم إسلامية رقائق,4.0,مراجعات\n97,اقتباسات\n39,القرّاء\n7945,344 صفحة,طبعات\n4,نشر سنة 2009,,https://www.abjjad.com/book/15445916/%D8%A7%D8...,https://abjjadst.blob.core.windows.net/pub/2f2...
1,ISBN 13 9789777195522,عبقرية عمر,تأليف عباس محمود العقاد,المؤلفون\n1,يزخر التاريخ الإسلامي برجال عِظام سطروا حوادثه...,علوم إسلامية سيرة الصحابة,4.1,مراجعات\n64,اقتباسات\n114,القرّاء\n7337,497 صفحة,طبعات\n3,نشر سنة 2014,,https://www.abjjad.com/book/15445340/%D8%B9%D8...,https://abjjadst.blob.core.windows.net/pub/e5f...
2,ISBN 13 9789777195331,عبقرية محمد,تأليف عباس محمود العقاد,المؤلفون\n1,احتفى التاريخ العربي بالسيرة المُحمدية؛ فأفرد ...,علوم إسلامية السيرة النبوية,4.3,مراجعات\n45,اقتباسات\n62,القرّاء\n4151,238 صفحة,طبعات\n2,نشر سنة 2013,,https://www.abjjad.com/book/15445339/%D8%B9%D8...,https://abjjadst.blob.core.windows.net/pub/7df...
3,,لأنك الله : رحلة إلى السماء السابعة,تأليف علي بن جابر الفيفي,المؤلفون\n2,كتاب يتحدث عن بعض أسماء الله الحسنى وكيف نعيشه...,علوم إسلامية رقائق,4.7,مراجعات\n83,اقتباسات\n99,القرّاء\n3133,192 صفحة,طبعات\n1,نشر سنة 2016,,https://www.abjjad.com/book/2653683959/%D9%84%...,https://abjjadst.blob.core.windows.net/pub/c35...
4,ISBN 13 9789777194693,عبقرية الإمام علي,تأليف عباس محمود العقاد,المؤلفون\n1,بَرَع «عباس محمود العقاد» في تناول شخصية الإما...,علوم إسلامية سيرة الصحابة,4.3,مراجعات\n30,اقتباسات\n25,القرّاء\n2592,87 صفحة,طبعات\n4,نشر سنة 2013,,https://www.abjjad.com/book/15445343/%D8%B9%D8...,https://abjjadst.blob.core.windows.net/pub/0b2...


In [21]:
full_dataset['ISBN'] = full_dataset['ISBN'].str.replace('ISBN 13', '')
full_dataset['ISBN'] = full_dataset['ISBN'].str.replace('ISBN', '')

In [22]:
#see: https://stackoverflow.com/questions/20894525/how-to-remove-parentheses-and-all-data-within-using-pandas-python
full_dataset['Author'] = full_dataset['Author'].str.replace('\s*\([^()]*\)', '') 
full_dataset['Author'] = full_dataset['Author'].str.replace('تأليف', '')
#full_dataset.loc[full_dataset['Author'].str.contains('Author')==True, 'تأليف']
#full_dataset.loc[full_dataset['Author'] == 'مشاركة']
#full_dataset[full_dataset.Authors_Number == 'المؤلفون\n3']

In [23]:
full_dataset['Authors_Number'] = full_dataset['Authors_Number'].str.replace('المؤلفون\n', '')
full_dataset['Authors_Number'] = full_dataset['Authors_Number'].str.replace('المؤلفون', '')

In [24]:
full_dataset['Reviews_Number'] = full_dataset['Reviews_Number'].str.replace('مراجعات\n', '')
full_dataset['Reviews_Number'] = full_dataset['Reviews_Number'].str.replace('مراجعات', '')

In [25]:
full_dataset['Quotes_Number'] = full_dataset['Quotes_Number'].str.replace('اقتباسات\n', '')
full_dataset['Quotes_Number'] = full_dataset['Quotes_Number'].str.replace('اقتباسات', '')

In [26]:
full_dataset['Community_Size'] = full_dataset['Community_Size'].str.replace('القرّاء\n', '')
full_dataset['Community_Size'] = full_dataset['Community_Size'].str.replace('القرّاء', '')

In [27]:
full_dataset['Pages_Number'] = full_dataset['Pages_Number'].str.replace('صفحة', '')

In [28]:
full_dataset['Editions'] = full_dataset['Editions'].str.replace('طبعات\n', '')
full_dataset['Editions'] = full_dataset['Editions'].str.replace('طبعات', '')

In [29]:
full_dataset['Publication_Year'] = full_dataset['Publication_Year'].str.replace('نشر سنة', '')

In [30]:
full_dataset['URL'] = full_dataset['URL'].str.replace('\/reviews', '')

In [31]:
full_dataset.head()

Unnamed: 0,ISBN,Title,Author,Authors_Number,Description,Genres,Average_Ratings,Reviews_Number,Quotes_Number,Community_Size,Pages_Number,Editions,Publication_Year,Publisher,URL,Cover_URL
0,9774416333.0,استمتع بحياتك,محمد عبد الرحمن العريفي,1,لما كنت فى السادسة عشرة من عمري وقع فى يدي كتا...,علوم إسلامية رقائق,4.0,97,39,7945,344,4,2009,,https://www.abjjad.com/book/15445916/%D8%A7%D8...,https://abjjadst.blob.core.windows.net/pub/2f2...
1,9789777195522.0,عبقرية عمر,عباس محمود العقاد,1,يزخر التاريخ الإسلامي برجال عِظام سطروا حوادثه...,علوم إسلامية سيرة الصحابة,4.1,64,114,7337,497,3,2014,,https://www.abjjad.com/book/15445340/%D8%B9%D8...,https://abjjadst.blob.core.windows.net/pub/e5f...
2,9789777195331.0,عبقرية محمد,عباس محمود العقاد,1,احتفى التاريخ العربي بالسيرة المُحمدية؛ فأفرد ...,علوم إسلامية السيرة النبوية,4.3,45,62,4151,238,2,2013,,https://www.abjjad.com/book/15445339/%D8%B9%D8...,https://abjjadst.blob.core.windows.net/pub/7df...
3,,لأنك الله : رحلة إلى السماء السابعة,علي بن جابر الفيفي,2,كتاب يتحدث عن بعض أسماء الله الحسنى وكيف نعيشه...,علوم إسلامية رقائق,4.7,83,99,3133,192,1,2016,,https://www.abjjad.com/book/2653683959/%D9%84%...,https://abjjadst.blob.core.windows.net/pub/c35...
4,9789777194693.0,عبقرية الإمام علي,عباس محمود العقاد,1,بَرَع «عباس محمود العقاد» في تناول شخصية الإما...,علوم إسلامية سيرة الصحابة,4.3,30,25,2592,87,4,2013,,https://www.abjjad.com/book/15445343/%D8%B9%D8...,https://abjjadst.blob.core.windows.net/pub/0b2...


## 5. Convert data type

In [32]:
full_dataset.update(full_dataset[['Average_Ratings','Reviews_Number','Quotes_Number', 'Community_Size', 'Pages_Number']].fillna(0))
full_dataset.update(full_dataset[['Authors_Number','Editions']].fillna(1))
#full_dataset['Average_Ratings'] = full_dataset['Average_Ratings'].fillna(0)
#full_dataset['Average_Ratings'] = full_dataset['Average_Ratings'].replace(np.nan, 0)

In [33]:
# Convert data types
full_dataset['Authors_Number'] = pd.to_numeric(full_dataset['Authors_Number']) # OR: full_dataset['Authors_Number'].apply(pd.to_numeric)
full_dataset['Reviews_Number'] = pd.to_numeric(full_dataset['Reviews_Number'])
full_dataset['Quotes_Number'] = pd.to_numeric(full_dataset['Quotes_Number'])
full_dataset['Community_Size'] = pd.to_numeric(full_dataset['Community_Size'])
full_dataset['Pages_Number'] = pd.to_numeric(full_dataset['Pages_Number'])
full_dataset['Editions'] = pd.to_numeric(full_dataset['Editions'])
full_dataset['Publication_Year'] = pd.to_numeric(full_dataset['Publication_Year'])

In [34]:
# Convert them into integers instead of float
'''full_dataset = full_dataset.astype({"Authors_Number": int, 
                                    "Reviews_Number": int,
                                    "Quotes_Number": int,
                                    "Community_Size": int,
                                    "Publication_Year": int})'''

'full_dataset = full_dataset.astype({"Authors_Number": int, \n                                    "Reviews_Number": int,\n                                    "Quotes_Number": int,\n                                    "Community_Size": int,\n                                    "Publication_Year": int})'

In [35]:
full_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16558 entries, 0 to 17868
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ISBN              8562 non-null   object 
 1   Title             16558 non-null  object 
 2   Author            16557 non-null  object 
 3   Authors_Number    16551 non-null  float64
 4   Description       16558 non-null  object 
 5   Genres            15178 non-null  object 
 6   Average_Ratings   16558 non-null  object 
 7   Reviews_Number    7627 non-null   float64
 8   Quotes_Number     4555 non-null   float64
 9   Community_Size    15777 non-null  float64
 10  Pages_Number      16558 non-null  int64  
 11  Editions          16558 non-null  int64  
 12  Publication_Year  14154 non-null  float64
 13  Publisher         1461 non-null   object 
 14  URL               16558 non-null  object 
 15  Cover_URL         16554 non-null  object 
dtypes: float64(5), int64(2), object(9)
memor

In [36]:
full_dataset.to_csv('Data/Books_Data/Prepared_Dataset.csv') # To publish on Kaggle as Arabic books dataset

In [37]:
final_dataset = pd.read_csv('Data/Books_Data/Prepared_Dataset.csv', skipinitialspace=True) 

## 6. Prepare the data for this project

### 6.1 Remove unwanted columns

In [38]:
final_dataset = final_dataset.drop(['Unnamed: 0', 'ISBN', 'Authors_Number', 'Average_Ratings', 'Reviews_Number', 'Quotes_Number', 'Community_Size', 'Pages_Number', 'Editions', 'Publication_Year', 'Publisher'], axis = 1)

### 6.2 Remove books with non Arabic description

In [39]:
final_dataset.sort_values(by=['Description'], ascending=[0])

Unnamed: 0,Title,Author,Description,Genres,URL,Cover_URL
1937,العلاقة بين الجن والإنس,إبراهيم كمال أدهم,👿 كتاب العلاقة بين الجن والإنس، تأليف: إبراهيم...,غرائب وأساطير سحر,https://www.abjjad.com/book/2660139008/%D8%A7%...,https://abjjadst.blob.core.windows.net/pub/ef0...
12657,رحلة الشيخ علي الليثي ببلاد النمسا وألمانيا,علي حسن الليثي,ﻳﺸﻜﻞ أدب اﻟﺮﺣﻼت، ﻣﺎدة ﻏﻨﯿﺔ وﻣﻤﺘﻌﺔ ﻓﻲ اﻟﺘﺮاث اﻟ...,سفر ورحلات أدب رحلات,https://www.abjjad.com/book/2177305009/%D8%B1%...,https://abjjadst.blob.core.windows.net/pub/650...
1483,معجم البيولوجيا انجليزى - فرنسى - عربى,هلا فلاح الخنساء,ﻳﺤﺘﻮي ھﺬا اﻟﻤﻌﺠﻢ ﻋﻠﻰ أﻛﺜﺮ ﻣﻦ 5600 ﻣﺪﺧﻞ ﻣﻌﺮف ﻓﻲ...,مراجع معاجم,https://www.abjjad.com/book/2178909847/%D9%85%...,https://abjjadst.blob.core.windows.net/pub/8c4...
1564,معجم الجيولوجيا انجليزى - فرنسى - عربى,حسين عبد المحسن حسين,ﻳﺤﺘﻮي ھﺬا اﻟﻤﻌﺠﻢ ﻋﻠﻰ أﻛﺜﺮ ﻣﻦ 2000 ﻣﺼﻄﻠﺢ ﻣﻌﺮف ﻓ...,,https://www.abjjad.com/book/2178877181/%D9%85%...,https://abjjadst.blob.core.windows.net/pub/993...
6314,الطريقة السهلة للإقلاع عن التدخين,ألان كار,ﻳﺘﻨـﺎول أﻻن ﻛﺎر ﻳﺘﻨـﺎول أﻻن ﻛﺎر، اﻟﺨﺒﻴﺮ ذاﺋﻊ ا...,صحة وطب علاج,https://www.abjjad.com/book/2206794146/%D8%A7%...,https://abjjadst.blob.core.windows.net/pub/ef1...
...,...,...,...,...,...,...
16524,القرار الإداري السلبي - دراسة مقارنة بالفقه ال...,شعبان عبد الحكيم سلامة,,قانون قانون إداري,https://www.abjjad.com/book/2486305117/%D8%A7%...,https://abjjadst.blob.core.windows.net/pub/86a...
16525,الرقابة القضائية على العقود الإدارية في مرحلتي...,محمد بن سعيد بن حمد المعمري,,قانون قانون إداري,https://www.abjjad.com/book/2486468610/%D8%A7%...,https://abjjadst.blob.core.windows.net/pub/11a...
16526,إنقضاء الخصومة الإدارية بالإرادة المنفردة للخص...,محمد باهي أبو يونس,,قانون قانون إداري,https://www.abjjad.com/book/2486501423/%D8%A7%...,https://abjjadst.blob.core.windows.net/pub/bf8...
16531,الإجراءات الجنائية قي النظم القانونية العربية ...,محمود شريف بسيوني,,قانون قانون جنائي,https://www.abjjad.com/book/15446889/%D8%A7%D9...,https://abjjadst.blob.core.windows.net/pub/559...


In [40]:
# Keep only the rows that contain description
final_dataset = final_dataset[final_dataset['Description'].notna()] 

In [41]:
final_dataset.shape

(14200, 6)

In [43]:
#detect(final_dataset['Description'][3]) != 'ar'

In [44]:
from langdetect import detect

#https://stackoverflow.com/questions/60930935/exclude-non-english-rows-in-pandas

def is_ar(txt):
    try:
        return detect(txt)=='ar'
    except:
        return False

full_dataset = full_dataset[full_dataset['Description'].apply(is_ar)]

In [45]:
full_dataset.shape

(14160, 16)

In [82]:
'''for index, i in enumerate(full_dataset['Description']):
    print(index)
    try:
        if detect(full_dataset['Description'][index]) != 'ar':
            print("Book number: ", index, "was dropped due to using another language")
            full_dataset.drop(index)
    except:
            print("Book number: ", index, "was dropped due to an exception")
            full_dataset.drop(index)
            continue'''

'for index, i in enumerate(full_dataset[\'Description\']):\n    print(index)\n    try:\n        if detect(full_dataset[\'Description\'][index]) != \'ar\':\n            print("Book number: ", index, "was dropped due to using another language")\n            full_dataset.drop(index)\n    except:\n            print("Book number: ", index, "was dropped due to an exception")\n            full_dataset.drop(index)\n            continue'

In [46]:
final_dataset.to_csv('Data/Descriptions_Dataset.csv') # To be used for recommendation system