# Technical Report

## Introduction

This report explores Instagram advertising metrics to offer actionable insights for investors. Analyzing data from over 730 brands, we aim to identify trends and anomalies to inform strategic investment decisions. Through rigorous data analysis and modeling, we provide a foundation for informed decision-making in the investment landscape.

## Data Processing (data_processing.ipynb):
In this section, we will load the data and will get a sense of the structure of the dataset and what each feature represents.

### Summary of Categorical Dataset Structure:

The dataset comprises categorical data across several columns, including `period_end_date`, `compset_group`, `compset`, `business_entity_doing_business_as_name`, `legal_entity_name`, `domicile_country_name`, `ultimate_parent_legal_entity_name`, and `primary_exchange_name`. Here's a brief summary of the key insights:

- The `period` column, indicating weekly periods, is redundant and can be dropped, with 704,313 rows in the dataset.
- `period_end_date` has 455 unique values, representing the end dates of the weekly periods.
- `compset_group` includes 20 unique categories, while `compset` contains 54 unique values.
- `business_entity_doing_business_as_name` and `legal_entity_name` have 706 and 423 unique values, respectively.
- `domicile_country_name` lists 26 unique countries, with `ultimate_parent_legal_entity_name` containing 401 unique values.
- `primary_exchange_name` includes 30 unique exchanges.

### Summary of Numerical Dataset Structure:
The `followers`, `pictures`, `videos`, `likes`, and `comments` columns contain numerical data, indicating various metrics related to Instagram activity. These columns require careful interpretation, considering factors such as post deletions and likes/comments on posts from previous weeks.
  
#### Numerical Data Analysis:

The numerical columns in the dataset provide insights into Instagram activity metrics for the brands under consideration. Here's a detailed analysis of each numerical feature:

1. **Followers:**
   - The `followers` column indicates the total number of followers a company had in a specific week.
   - The distribution of followers across brands can reveal trends in audience engagement and brand popularity over time.
   - Anomalies or sudden spikes in follower count may indicate significant events or marketing campaigns.

2. **Pictures and Videos:**
   - The `pictures` and `videos` columns represent the number of new pictures and videos posted on Instagram by each brand within a week.
   - These metrics reflect the brand's content creation and engagement strategies.
   - A higher number of new posts may indicate increased activity and content diversity, potentially leading to higher engagement.

3. **Likes and Comments:**
   - The `likes` and `comments` columns denote the number of new likes and comments gained by a brand's posts on Instagram within a week.
   - These metrics gauge audience interaction and sentiment towards the brand's content.
   - Analysis of likes and comments trends can highlight content effectiveness and audience engagement patterns.
   - Anomalies in likes and comments metrics may signal viral content or influencer collaborations, impacting brand visibility and perception.

4. **Interpretation Considerations:**
   - It's important to consider factors such as post deletions and likes/comments on posts from previous weeks, as they may influence the interpretation of numerical metrics.
   - Outliers and sudden fluctuations in numerical data should be investigated further to understand their underlying causes, whether they result from genuine trends or data anomalies.

By conducting comprehensive analysis and visualization of numerical metrics, we can gain valuable insights into brand performance, audience engagement dynamics, and the effectiveness of Instagram marketing strategies. These insights are instrumental in informing investment decisions and identifying opportunities for strategic partnerships and growth.

## Data Cleaning (data_cleaning.ipynb):

The dataset underwent a comprehensive cleaning process, which can be categorized into two main parts:

### Cleaning the Dataset (Categorical Data):

#### Dealing with Missing Values:
- **Identification of Missing Values**: The dataset was inspected for missing values, particularly focusing on columns such as 'domicile_country_name' and 'primary_exchange_name', which exhibited substantial missingness.
- **Strategies for Handling Missing Values**: Options considered included dropping columns with significant missing values, imputing missing values based on available data, and replacing missing values with 'Unknown'.
- **Correlation Analysis**: Nullity correlation heatmap and dendrogram revealed high correlation between certain columns like 'ultimate_parent_legal_entity_name' and 'legal_entity_name', suggesting correlated missingness.
- **Decision-Making Process**: Decisions were made to either drop redundant columns or impute missing values based on data patterns and relevance to the analysis.

#### Dealing with Data Contaminations:
- **Duplicate Row Identification**: Duplicate rows were identified and addressed by replacing NaN values with appropriate values.
- **Correction of Erroneous Values**: Erroneous values such as incorrect country names and concatenated entries were corrected to maintain data integrity.
- **Identification of Significant Findings**: The identification of 'All Brands' entries and their replacement with 'No Parents' highlighted a crucial aspect for model bias mitigation. It's imperative to remove 'All Brands' entries from the dataset before training AI models to prevent bias in the model's predictions.

### Cleaning the Dataset (Numerical Data):

#### Dealing with Missing Values:
- **Imputation Strategy Decision**: The missing value patterns in numerical columns like 'followers', 'pictures', 'likes', and 'videos' were analyzed to determine suitable imputation strategies.
- **Selection of Imputation Methods**: Linear interpolation, linear regression, and KNN imputation were chosen based on the linearity of data distributions and relationships between columns.
- **Handling Extreme Cases**: Cases with excessive missing values in a row were addressed using mean imputation to ensure completeness in the dataset.

The cleaning process resulted in a refined dataset ready for further analysis, including Exploratory Data Analysis (EDA) and statistical modeling.

## Sector Trends Analysis (Sector_Trends_moein.ipynb):

### Extra Statistics
In exploring the dataset, we sought to derive more generalized metrics by combining various individual metrics. For instance, we introduced metrics like "likes per picture" to provide broader insights into Instagram engagement.

### All Brands Sector
We aggregated data from all brands to create an overarching sector known as "All Brands."

### Sector Statistics
We conducted an in-depth analysis of sectors to identify significant disparities in metrics such as likes, comments, etc. across different industry segments. Our findings revealed that certain sectors exhibit exceptional performance on social media platforms like Instagram. We concluded that focusing on these top-performing sectors is essential for leveraging Instagram data for investment purposes, while other sectors may not offer as much utility in the dataset.

### Digging Deeper into Sectors and Ranking Companies
We delved deeper into specific sectors, such as 'Apparel Retail,' to rank companies based on key metrics like likes. Using a sorted list of companies with notable metrics, we visualized the performance trends of individual companies over time. This analysis facilitates a deeper understanding of sector-specific dynamics and enables the identification of top-performing companies within each sector.


## EDA Report (EDA.ipynb): Exploring Instagram Data and Stock Market Trends

### Introduction
In this exploratory data analysis (EDA), we delve into Instagram data from various companies, aiming to uncover insights into social media engagement trends and their potential correlation with stock market performance. The dataset includes information such as followers, likes, comments, and stock prices, among others.

### Data Exploration and Cleaning
- Loaded the dataset from a CSV file and dropped redundant columns (`period` and `calculation_type`).
- Sorted the data by `period_end_date` to facilitate temporal analysis.
- Conducted descriptive statistics to gain initial insights into the dataset.

### Understanding Instagram Metrics
- Analyzed data for specific brands like Versace, GAF, and John John Denim to understand the nature of metrics such as followers, likes, comments, pictures, and videos.
- The question was, were these weekly changes in likes, or the total number of likes on the brand page?
- We assumed that these metrics represent weekly changes in likes, comments, etc., rather than cumulative totals.
  - This assumption was made by diving into a specfic example where a brand (GAF), had a massive spike in their likes, going from 5k likes to 100k likes for 3 weeks, then moving back down to 5k likes
  - We found this is because they collaborated with a celebrity, who also posted the videos on his instagram page, which led to the spike in likes
  - We concluded that the most likely explaination was that the data was weekly changes in likes, comments, etc.
- Investigated correlations between different metrics, revealing insights such as the positive correlation between likes and pictures.

### Sector Trends and Company Ranking
- Explored sector-wise trends to identify sectors with significant social media engagement.
- Ranked companies within sectors based on key metrics like likes, comments, etc., to identify top performers.

### Correlation Analysis with Stock Market Data
- Integrated Instagram data with stock market data for companies like Nike to explore potential correlations.
- Analyzed correlations between Instagram metrics and stock returns over different time periods.
- Explored delayed impacts on stock values by shifting stock data for specific weeks and examining correlations.

### Conclusion
- The analysis provides valuable insights into social media engagement patterns on Instagram across various companies and sectors.
- While some correlations between Instagram metrics and stock market performance were observed, further analysis is needed to validate these findings and explore causality.
- Especially when looking at individual brands, there was no clear correlation between Instagram performance and the stock market, because there is so much random noise in Instagram performance (as expected)
- The findings highlight that instagram data may not be the best predictor of stock market performance on its own, but could be used in combination with other data sources to make more informed on brand's stock
### Recommendations
- Further research should focus on refining correlation analysis techniques and exploring causal relationships between Instagram engagement metrics and stock market performance.
- Continuous monitoring of social media trends and stock market fluctuations is essential for informed decision-making and investment strategies.