### Importing libraries & dataset

In [137]:
import pandas as pd

In [138]:
df = pd.read_csv('../Data/Raw/fakeReviewData.csv')

___

### Initial exploration

#### 1. How big is the data ?

In [139]:
df.shape

(40432, 4)

#### 2. How does the data look like ?

In [140]:
df.head()

Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5.0,CG,"Love this! Well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. I..."
2,Home_and_Kitchen_5,5.0,CG,This pillow saved my back. I love the look and...
3,Home_and_Kitchen_5,1.0,CG,"Missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5.0,CG,Very nice set. Good quality. We have had the s...


>We have 4 columns in our dataset
>1) category : Representing the category to which the item belongs to
>2) rating : The rating that this user gave to product
>3) label : Our target column showing whether its CG (Computer generating) or OR (Original)
>4) text_ : The actual text that customer wrote for product.

#### 3. What are the datatypes of columns ?

In [141]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40432 entries, 0 to 40431
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   category  40432 non-null  object 
 1   rating    40432 non-null  float64
 2   label     40432 non-null  object 
 3   text_     40432 non-null  object 
dtypes: float64(1), object(3)
memory usage: 1.2+ MB


In [142]:
df['rating'].unique()

array([5., 1., 3., 2., 4.])

>Since there are no floating point values like 4.5 or 3.9 we can just turn the column into int

In [143]:
before = df.memory_usage(deep=True)
df['rating'] = df['rating'].astype(int)
after = df.memory_usage(deep=True)
print(f"we saved this many bytes of memory by conversion \n{before-after}")

we saved this many bytes of memory by conversion 
Index            0
category         0
rating      161728
label            0
text_            0
dtype: int64


>By simply converting the unnecessary datatype we saved this much amount of bytes, even thought this is very small amount it is good practice since we might also work with large amount of data

In [144]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40432 entries, 0 to 40431
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  40432 non-null  object
 1   rating    40432 non-null  int32 
 2   label     40432 non-null  object
 3   text_     40432 non-null  object
dtypes: int32(1), object(3)
memory usage: 1.1+ MB


>in this output also we can see that we saved around 0.1 MB of size from before

#### 4. Are there missing values ?

In [145]:
df.isnull().sum()

category    0
rating      0
label       0
text_       0
dtype: int64

#### 5. How does data look mathematically

In [146]:
df.describe()

Unnamed: 0,rating
count,40432.0
mean,4.256579
std,1.144354
min,1.0
25%,4.0
50%,5.0
75%,5.0
max,5.0


#### 6. Are there duplicates in our dataset ?

In [147]:
df.duplicated().sum()

12

>There are 12 duplicates so we can just drop them and our dataset won't have much of an impact

In [148]:
df.drop_duplicates(inplace=True)

In [149]:
df.duplicated().sum()

0

#### 7. How is the correlation between columns ?

>Since currently we have only 1 numerical field (rating) we are not able to find correlation between columns so we will do this step after pre processing where we will convert the string values to numbers and then maybe we can find something

#### Saving all the changes to a file

In [150]:
df.to_csv("../Data/Pre-processed/exploration.csv",index=False)