# Facebook Data Crawling
After the data is collected, we need to process the data to make it more readable and easier to analyze. We can do some simple visualization to see the data distribution and the relationship between the data. We can also do some simple data cleaning to remove the data that is not needed. In this notebook, we will do some simple data processing on the data we collected from Facebook.

In [1]:
%pip install matplotlib pandas numpy seaborn wordcloud

Collecting seaborn
  Obtaining dependency information for seaborn from https://files.pythonhosted.org/packages/7b/e5/83fcd7e9db036c179e0352bfcd20f81d728197a16f883e7b90307a88e65e/seaborn-0.13.0-py3-none-any.whl.metadata
  Downloading seaborn-0.13.0-py3-none-any.whl.metadata (5.3 kB)
Collecting wordcloud
  Obtaining dependency information for wordcloud from https://files.pythonhosted.org/packages/34/ac/72a4e42e76bf549dfd91791a6b10a9832f046c1d48b5e778be9ec012aa47/wordcloud-1.9.2-cp311-cp311-win_amd64.whl.metadata
  Downloading wordcloud-1.9.2-cp311-cp311-win_amd64.whl.metadata (3.4 kB)
Downloading seaborn-0.13.0-py3-none-any.whl (294 kB)
   ---------------------------------------- 0.0/294.6 kB ? eta -:--:--
   -------- ------------------------------- 61.4/294.6 kB 1.7 MB/s eta 0:00:01
   ------------------- -------------------- 143.4/294.6 kB 1.7 MB/s eta 0:00:01
   ------------------------------------- -- 276.5/294.6 kB 1.7 MB/s eta 0:00:01
   ---------------------------------------- 294


[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
raw_df = pd.read_csv('Data/VolleyballWorld.csv')


# Data PreProcessing

In [3]:
# Thông tin về dữ liệu thô raw_df
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 52 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   post_id                        90 non-null     int64  
 1   text                           90 non-null     object 
 2   post_text                      90 non-null     object 
 3   shared_text                    0 non-null      float64
 4   original_text                  0 non-null      float64
 5   time                           90 non-null     object 
 6   timestamp                      90 non-null     int64  
 7   image                          6 non-null      object 
 8   image_lowquality               90 non-null     object 
 9   images                         90 non-null     object 
 10  images_description             90 non-null     object 
 11  images_lowquality              90 non-null     object 
 12  images_lowquality_description  90 non-null     objec

# Handling null data

In [4]:
# Loại bỏ những cột dữ liệu mà tất cả các giá trị của các dòng đều là NAN
raw_df = raw_df.dropna(axis='columns', how='all')
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 37 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   post_id                        90 non-null     int64  
 1   text                           90 non-null     object 
 2   post_text                      90 non-null     object 
 3   time                           90 non-null     object 
 4   timestamp                      90 non-null     int64  
 5   image                          6 non-null      object 
 6   image_lowquality               90 non-null     object 
 7   images                         90 non-null     object 
 8   images_description             90 non-null     object 
 9   images_lowquality              90 non-null     object 
 10  images_lowquality_description  90 non-null     object 
 11  video                          84 non-null     object 
 12  video_id                       84 non-null     float

In [5]:
# Chuan hoa video_id vaf image_id , with 1 la co, 0 la khong 
raw_df['video_id'].fillna(0, inplace=True)
raw_df['image_id'].fillna(0, inplace=True)
raw_df['with'].fillna(0, inplace=True)
raw_df['video_id'].loc[raw_df['video_id'] != 0] = 1
raw_df['image_id'].loc[raw_df['image_id'] != 0] = 1
raw_df['with'].loc[raw_df['with'] != 0] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  raw_df['video_id'].loc[raw_df['video_id'] != 0] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  raw_df['image_id'].loc[raw_df['image_id'] != 0] = 1


In [6]:
#Loại bỏ cột có cùng giá trị các dòng ()
raw_df = raw_df.T.drop_duplicates().T
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 35 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   post_id                        90 non-null     object
 1   text                           90 non-null     object
 2   time                           90 non-null     object
 3   timestamp                      90 non-null     object
 4   image                          6 non-null      object
 5   image_lowquality               90 non-null     object
 6   images                         90 non-null     object
 7   images_description             90 non-null     object
 8   images_lowquality              90 non-null     object
 9   images_lowquality_description  90 non-null     object
 10  video                          84 non-null     object
 11  video_id                       90 non-null     object
 12  video_thumbnail                84 non-null     object
 13  likes  

# Xoá các cột không phân tích

In [7]:
pd.options.display.max_colwidth = 100

In [7]:
raw_df.drop(columns=['images','images_lowquality','image_lowquality','images_description','images_lowquality_description' ], inplace=True)
raw_df.drop(columns = ['timestamp', 'image', 'video', 'video_thumbnail'], inplace = True)
raw_df.drop(columns = ['link', 'links', 'user_id', 'username', 'user_url', 'is_live', 'available'], inplace = True)
raw_df.drop(columns = ['likes', 'w3_fb_url', 'page_id', 'image_ids', 'image_id','fetched_time', 'post_url'], inplace = True)

In [8]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   post_id         90 non-null     object
 1   text            90 non-null     object
 2   time            90 non-null     object
 3   video_id        90 non-null     object
 4   comments        90 non-null     object
 5   shares          90 non-null     object
 6   comments_full   90 non-null     object
 7   reactors        90 non-null     object
 8   reactions       90 non-null     object
 9   reaction_count  90 non-null     object
 10  with            90 non-null     object
 11  header          70 non-null     object
dtypes: object(12)
memory usage: 8.6+ KB


In [9]:
# fill nan header
raw_df.fillna({'header': "No header"}, inplace=True)
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   post_id         90 non-null     object
 1   text            90 non-null     object
 2   time            90 non-null     object
 3   video_id        90 non-null     object
 4   comments        90 non-null     object
 5   shares          90 non-null     object
 6   comments_full   90 non-null     object
 7   reactors        90 non-null     object
 8   reactions       90 non-null     object
 9   reaction_count  90 non-null     object
 10  with            90 non-null     object
 11  header          90 non-null     object
dtypes: object(12)
memory usage: 8.6+ KB


In [10]:
# Tách dữ liệu time ra thành ngày/
raw_df['date']= pd.to_datetime(raw_df['time']).dt.strftime('%Y-%m-%d')
raw_df['exact_time']= pd.to_datetime(raw_df['time']).dt.strftime('%H:%M:%S')
#raw_df.drop(columns= ['time'], inplace=True)
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   post_id         90 non-null     object
 1   text            90 non-null     object
 2   time            90 non-null     object
 3   video_id        90 non-null     object
 4   comments        90 non-null     object
 5   shares          90 non-null     object
 6   comments_full   90 non-null     object
 7   reactors        90 non-null     object
 8   reactions       90 non-null     object
 9   reaction_count  90 non-null     object
 10  with            90 non-null     object
 11  header          90 non-null     object
 12  date            90 non-null     object
 13  exact_time      90 non-null     object
dtypes: object(14)
memory usage: 10.0+ KB


In [11]:
ss = raw_df['reactions']

In [12]:
df = pd.DataFrame({
    'like': [],
    'love': [],
    'haha': [],
    'care': [],
    'wow': [],
    'sad': [],
    'angry': []
})

In [13]:
#like, live , haha, care, wow, sad, angry
for s in ss:
    s = s.replace("{", "")
    s = s.replace("}", "")
    new_row = np.zeros(7, dtype=np.int32)
    list = s.split(",")
    for l in list:
        sub_list = l.split(":")
        sub_list[0] = sub_list[0].replace("'", "").strip()
        if (sub_list[0][0:3] == "lik"):
            new_row[0] = int(sub_list[1])
        elif (sub_list[0][0:3] == "lov"):
            new_row[1] = int(sub_list[1])
        elif (sub_list[0][0:3] == "hah"):
            new_row[2] = int(sub_list[1])
        elif (sub_list[0][0:3] == "car"):
            new_row[3] = int(sub_list[1])
        elif (sub_list[0][0:3] == "wow"):
            new_row[4] = int(sub_list[1])
        elif (sub_list[0][0:3] == "sad"):
            new_row[5] = int(sub_list[1])
        elif (sub_list[0][0:3] == "ang"):
            new_row[6] = int(sub_list[1])
    df.loc[len(df.index)] = new_row     
    
    

In [14]:
df

Unnamed: 0,like,love,haha,care,wow,sad,angry
0,48,14,1,0,1,0,0
1,169,50,0,1,0,0,0
2,290,64,0,1,0,0,0
3,1499,1169,1,20,5,0,0
4,267,63,0,3,1,0,0
...,...,...,...,...,...,...,...
85,1338,210,1,5,6,0,1
86,3784,87,3547,23,38,17,0
87,2130,894,2,14,10,0,0
88,1138,140,0,8,3,0,0


In [15]:
clean_df = raw_df.drop(columns=['reactions']).join(df)
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   post_id         90 non-null     object
 1   text            90 non-null     object
 2   time            90 non-null     object
 3   video_id        90 non-null     object
 4   comments        90 non-null     object
 5   shares          90 non-null     object
 6   comments_full   90 non-null     object
 7   reactors        90 non-null     object
 8   reaction_count  90 non-null     object
 9   with            90 non-null     object
 10  header          90 non-null     object
 11  date            90 non-null     object
 12  exact_time      90 non-null     object
 13  like            90 non-null     int32 
 14  love            90 non-null     int32 
 15  haha            90 non-null     int32 
 16  care            90 non-null     int32 
 17  wow             90 non-null     int32 
 18  sad         

In [18]:
#save to csv file
clean_df.to_csv("Data/clean_data.csv", index=False)