# Data Cleaning
Now that we have _raw_ scrapping data its time to clean it up. 

In [1]:
import pandas as pd

raw_data = pd.read_csv("data/avito_data.csv",index_col=0)
print(raw_data.shape)
raw_data.head(2)

(1988, 12)


Unnamed: 0,subject,description,Price,Type,Secteur,Prix / m²,Salons,Frais de syndic / mois,Étage,Âge du bien,Surface habitable,link
0,Appartement à Haut Anza,Appartement neuf bien agencé à trois façades d...,305000,"Appartements, Vente",Haut Anza,,1.0,50 DH,3,,60 m²,https://avito.ma/fr/haut_anza/appartements/App...
1,appartement à vendre Hay Salam Agadir,Appartements à vendre dans la première ville d...,480000,"Appartements, Vente",Hay Salam,,1.0,100 DH,3,,64 m²,https://avito.ma/fr/hay_salam/appartements/app...


First of all we see alot of NaN values meaning missing value and douplicates rows


 To find duplicate values in the dataset we will use a simple dataframe function : duplicated()

In [2]:
dupli=raw_data.duplicated()
dupli

0       False
1       False
2       False
3       False
4       False
        ...  
1983    False
1984    False
1985    False
1986    False
1987    False
Length: 1988, dtype: bool

This function provides bool values for duplicate values in the dataset.

In [3]:
print(raw_data[dupli].shape)
raw_data[dupli].head(2)


(196, 12)


Unnamed: 0,subject,description,Price,Type,Secteur,Prix / m²,Salons,Frais de syndic / mois,Étage,Âge du bien,Surface habitable,link
35,Appartement à Haut Anza,Appartement neuf bien agencé à trois façades d...,305000,"Appartements, Vente",Haut Anza,,1.0,50 DH,3,,60 m²,https://avito.ma/fr/haut_anza/appartements/App...
36,appartement à vendre Hay Salam Agadir,Appartements à vendre dans la première ville d...,480000,"Appartements, Vente",Hay Salam,,1.0,100 DH,3,,64 m²,https://avito.ma/fr/hay_salam/appartements/app...


There are 196 duplicates we need to remove them, the method .drop_duplicates() does excatly what we want 

In [4]:
db_data=raw_data.drop_duplicates()

In [5]:
print(raw_data.shape,db_data.shape)

(1988, 12) (1792, 12)


Now lets get some info on each column of the dataset.

In [6]:
db_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1792 entries, 0 to 1987
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   subject                 1792 non-null   object 
 1   description             1792 non-null   object 
 2   Price                   1792 non-null   int64  
 3   Type                    1792 non-null   object 
 4   Secteur                 1792 non-null   object 
 5   Prix / m²               2 non-null      object 
 6   Salons                  1445 non-null   float64
 7   Frais de syndic / mois  647 non-null    object 
 8   Étage                   1572 non-null   object 
 9   Âge du bien             968 non-null    object 
 10  Surface habitable       1038 non-null   object 
 11  link                    1792 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 182.0+ KB


As you can see the **Prix / m²** column only has 2 non null valeus its safe to drop this column all together since it's not very useful 

In [7]:
db_data=db_data.drop(columns=['Prix / m²'])

the **Surface habitable** column has alot of Null valeus, this column is important to our analysis so we cant drop it we have to deal with these missing valeus.
Luckly the **subject** and **description** comlumns may have these missing valeus.

## Dealing with NaN

In [8]:
db_data.head(2)

Unnamed: 0,subject,description,Price,Type,Secteur,Salons,Frais de syndic / mois,Étage,Âge du bien,Surface habitable,link
0,Appartement à Haut Anza,Appartement neuf bien agencé à trois façades d...,305000,"Appartements, Vente",Haut Anza,1.0,50 DH,3,,60 m²,https://avito.ma/fr/haut_anza/appartements/App...
1,appartement à vendre Hay Salam Agadir,Appartements à vendre dans la première ville d...,480000,"Appartements, Vente",Hay Salam,1.0,100 DH,3,,64 m²,https://avito.ma/fr/hay_salam/appartements/app...


In [68]:
df=db_data[db_data["Surface habitable"].isna()][['subject','description',"Surface habitable",'Price','link']]

60m2,109 mètre , 88m², 68 m2 
60م ,  
85 متر
85متر 

In [69]:
i=49



print(df['description'].iloc[i],'\n\n',df['Price'].iloc[i],'\n\n',df['subject'].iloc[i])

شقق عائلية للبيع مفروشة ب (اقامة تزرزيت ) الحي المحمادي
شقة جيدة للبيع ب الحي المحمدي اكادير ، مساحتها الإجمالية 54متر، تتواجد بالطابق الثالث ، تتوفر على 2 غرف ، صالون، حمام ، ومطبخ
تمن البيع 480000
للمزيد من المعلومات المرجو الاتصال بنا
 

 480000 

 شقة  للبيع مفروشة ب اقامة تزرزيت الحي المحمادي


In [201]:
text=df['description'].iloc[2]
text

'Affaire urgent reste deux appartements titres à vendre au village de Tamraght :\r\n   -109 mètre : 520.000 dh \r\n- 2 chambres , salon , cuisine équipée , deux salles de bain .\r\n*Honoraires de l’agence :\r\n- En cas de vente : 2.5 % HT du montant de la vente.\r\n*La signature d’un bon de visite est obligatoire.\r\nPour Infos veillez contacter\r\n'

In [165]:
import re



## Column data types

Surface habitable column representes the surface of the appartement, lets rename it and trim (m²) then convert it to int.

In [95]:
db_data=db_data.rename(columns={"Surface habitable": "Surface habitable (m²)"})

In [96]:
db_data["Surface habitable (m²)"]=db_data["Surface habitable (m²)"].str.strip(" m²")


In [97]:
db_data["Surface habitable (m²)"]=db_data["Surface habitable (m²)"].astype("int")
db_data.head()

ValueError: cannot convert float NaN to integer