# Clean & Analyse Social Media

This project is about social media usage data analysis to help creating comprehensive report that analyses the performance of different categories of social media posts.

## Scenario

Suppose we work for a social media marketing company that specializes in promoting brands and products on a popular social media platform. Our team is responsible for analyzing the performance of different types of posts based on categories, such as health, family, food, etc. to help clients optimize their social media strategy and increase their reach and engagement.  

We want you to use Python to automatically extract tweets posted from one or more categories, and then clean, analyze and visualize the data. The team will use our analysis to  making data-driven recommendations to clients to improve their social media performance. This feature will help the marketing agency deliver tweets on time, within budget, and gain fast results.

## Objectives

The project objectives are:
1. Increase client reach and engagement
2. Gain valuable insights that will help improve social media performance
3. Achieve their social media goals and provide data-driven recommendations

## 1. Importing Required Libraries

The first step in every data, machine learning project is to import required libraries. For this project, we'll need the following libraries:
- `pandas` for data manipulation and analysis
- `Numpy` for numerical computations
- `Matplotlib` for data visualization
- `Seaborn` for statical data visialization
- `Random` for generating random numbers.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random

## 2. Generating Random Data For The Social Media Data

After importing required libraries, let's generate some random tweet data to analyze. To do this, We are going to use pandas `date_range` to choose a pseudo-random date within a range, the random module's `choice` to create a choice from a list, and numpys's `random` to create a random integer. But first we need to define a list of categories for the social media experiment.

In [2]:
# List of categories
categories = ['food', 'travel', 'fashion', 'fitness', 'music', 'culture', 'family', 'health']

# Generate random data
data = {
    'Date': pd.date_range('2022-01-02', periods=500),  # `periods` is the number of timesteps
    'Category': [random.choice(categories) for _ in range(500)],  # Generate a list of 500 categories from the list of categories
    'Likes': np.random.randint(0, 10000, size=500)  # Generate 5000 integer numbers between 0 and 10000
}

## 3. Loading the Data into a Pandas DataFrame and Exploring it

Now, let's load our randomly generate data into a pandas dataframe and explore it. To do so:
1. We'll use pandas `DataFrame()` method to wich we provide our data as parameter.
2. We display the dataframe head using the `pandas.DataFrame.head()` method wich returns the first n rows of the object based on position (n=5 by defautl).
3. Print the count of each `Category` element using the `value_counts()` method available for `pandas.DataFrame` objects. This method returns a Series containing the frequency of each distinct row in the Dataframe.
4. Generate descriptive statistics which includes those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. Here, we'll use the `pandas.DataFrame.describe()` method which will inform as about:
    - count: The number of non-NA/null observations.
    - mean: The mean of the values.
    - min: The minimum of the values in the object.
    - 25%: The lower percentile.
    - 50%: The 50 percentile which is the same as the median.
    - 75%: The upper percentile.
    - max: The maximum of the values in the object.
    - std: The standard deviation of the observations.
5. Print a concise summary of the dataframe thanks to `pandas.DataFrame.info()`. This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

In [3]:
# Load data into a pandas dataframe
df = pd.DataFrame(data)

# Head of the dataframe
df.head()

Unnamed: 0,Date,Category,Likes
0,2022-01-02,health,3737
1,2022-01-03,music,2975
2,2022-01-04,family,2380
3,2022-01-05,food,4975
4,2022-01-06,fashion,2901


In [4]:
# Shape of the dataframe
print (f"The shape of the dataframe : {df.shape}")

The shape of the dataframe : (500, 3)


In [5]:
# Count of each `Category` element
df['Category'].value_counts()

Category
health     74
food       71
fitness    69
fashion    67
travel     59
music      58
culture    52
family     50
Name: count, dtype: int64

In [6]:
# Descriptive statictics
df.describe()

Unnamed: 0,Date,Likes
count,500,500.0
mean,2022-09-08 12:00:00,4961.836
min,2022-01-02 00:00:00,41.0
25%,2022-05-06 18:00:00,2618.5
50%,2022-09-08 12:00:00,5180.5
75%,2023-01-11 06:00:00,7463.25
max,2023-05-16 00:00:00,9989.0
std,,2880.269088


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      500 non-null    datetime64[ns]
 1   Category  500 non-null    object        
 2   Likes     500 non-null    int64         
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 11.8+ KB


In [8]:
# Missing values
print(f"The number of missing values (na) is each column :\n{df.isna().sum()}")

The number of missing values (na) is each column :
Date        0
Category    0
Likes       0
dtype: int64
