<h1 style= "color:#9370DB;"> Stock Analysis </h1>

In [1]:
# 📚 Libraries 
import kagglehub
import pandas as pd
import numpy as np
import os

# 📊 Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as g

### The Stock Analysis Dataset:


**First impressions:**
    
_____________

The **S&P 500** is a stock market index tracking the performance of the largest 500 publicly traded companies listed on U.S. stock exchanges.

Investors have long used the S&P 500 as a benchmark for their investments as it tends to signal overall market health. 
The index is a popular choice for long-term inverstors who wish to watch growth over the coming deacades. 

The dataset contains: 
- S&P 500 **Index**: Contains the daily price of the index, representing the overall performance of the 500 companies in the S&P 500.
- S&P 500 **Stocks**: Includes the daily stock prices for each company within the index, providing insights into individual stock movements. 
- S&P 500 **Companies**: Provides detailed information about each company, including metrics such as Name, Sector, Marketcap, Ebitda, Weight.

The data types are even: (13 int or float / 13 objects).

Our **project goal** is to identify the performance of various sectors in the S&P 500. After reading the [documentation](https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks) we will proceed with the following **strategy**:

1. The **target** of our dataset will be `currentprice`, which is the actual price of the stock right now.
2. Through **Exploratory Data Analysis** we will identify the performance of various sectors and stocks. 

<h2 style="color: #9370DB;"> 01 | Data Extraction </h2>

In [2]:
data = pd.read_csv('social_media_entertainment_data.csv')

In [3]:
# Cleaning columns with snake_case 
data.columns = [col.lower().replace(" ", "_")for col in data.columns] 

<h3 style="color: #4169E1;">1.1 | Exploring the Data </h3>

In [4]:
data

Unnamed: 0,user_id,age,gender,country,daily_social_media_time_(hrs),daily_entertainment_time_(hrs),social_media_platforms_used,primary_platform,daily_messaging_time_(hrs),daily_video_content_time_(hrs),...,ad_interaction_count,time_on_educational_platforms_(hrs),parental_status,tech_savviness_level_(scale_1-10),preferred_device_for_entertainment,data_plan_used,digital_well-being_awareness,sleep_quality_(scale_1-10),social_isolation_feeling_(scale_1-10),monthly_expenditure_on_entertainment_(usd)
0,1,32,Other,Germany,4.35,4.08,5,TikTok,0.35,5.43,...,20,4.11,Yes,9,Tablet,50GB,Moderate,7,8,33.04
1,2,62,Other,India,4.96,4.21,2,YouTube,2.55,4.22,...,26,4.59,Yes,9,PC,10GB,Low,8,2,497.78
2,3,51,Female,USA,6.78,1.77,4,Facebook,2.09,1.09,...,47,0.66,Yes,9,Tablet,10GB,High,5,3,71.72
3,4,44,Female,India,5.06,9.21,3,YouTube,3.69,4.80,...,22,3.44,Yes,7,Tablet,10GB,Low,9,9,129.62
4,5,21,Other,Germany,2.57,1.30,4,TikTok,3.97,2.74,...,42,4.14,Yes,7,Smart TV,Unlimited,Low,5,9,35.90
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,299996,17,Female,Canada,1.98,8.41,1,Twitter,2.37,5.25,...,42,2.15,Yes,7,Tablet,10GB,Low,5,8,47.46
299996,299997,45,Female,Germany,3.78,2.18,2,YouTube,3.15,6.84,...,16,0.02,No,8,Smart TV,50GB,Low,7,8,479.79
299997,299998,45,Male,Canada,6.25,4.63,1,Facebook,4.16,2.43,...,24,1.88,Yes,6,Tablet,10GB,Moderate,1,8,53.05
299998,299999,15,Male,USA,6.47,2.34,4,YouTube,2.83,5.10,...,35,0.96,No,7,Smart TV,10GB,High,7,5,432.00


### Dataset Description: 

A brief analysis of each column. 
- `Date`: The specific date for which the stock date is recorded. 
- `Symbol`: A unique "ticker" code that identifies the company on the stock exchange. 
- `Adj_close`: The closing price of the stock after adjustments for dividends, splits, or other corporate actions. 
- `Close`: The unadjusted closing price of the stock on a given date.  
- `High`: The highest price at which the stock traded during the day.  
- `Low`: The lowest price at which the stock traded during the day. 
- `Open`: The price at which the stock started trading at the beginning of the day.
- `Volume`: The total number of shares traded during the day.

<h3 style="color: #4169E1;">1.2 | Copies</h3>

In [6]:
df = data.copy()

<h2 style="color: #9370DB;"> 02 | Data Cleaning </h2>

<h3 style="color: #4169E1;"> 2.1 | Dealing with Data types</h3>

In [5]:
data.dtypes

user_id                                         int64
age                                             int64
gender                                         object
country                                        object
daily_social_media_time_(hrs)                 float64
daily_entertainment_time_(hrs)                float64
social_media_platforms_used                     int64
primary_platform                               object
daily_messaging_time_(hrs)                    float64
daily_video_content_time_(hrs)                float64
daily_gaming_time_(hrs)                       float64
occupation                                     object
marital_status                                 object
monthly_income_(usd)                          float64
device_type                                    object
internet_speed_(mbps)                         float64
subscription_platforms                          int64
average_sleep_time_(hrs)                      float64
physical_activity_time_(hrs)

<h3 style="color: #4169E1;"> 2.2 | Dealing with NaN values</h3>

In [7]:
df.isna().sum()

user_id                                       0
age                                           0
gender                                        0
country                                       0
daily_social_media_time_(hrs)                 0
daily_entertainment_time_(hrs)                0
social_media_platforms_used                   0
primary_platform                              0
daily_messaging_time_(hrs)                    0
daily_video_content_time_(hrs)                0
daily_gaming_time_(hrs)                       0
occupation                                    0
marital_status                                0
monthly_income_(usd)                          0
device_type                                   0
internet_speed_(mbps)                         0
subscription_platforms                        0
average_sleep_time_(hrs)                      0
physical_activity_time_(hrs)                  0
reading_time_(hrs)                            0
work/study_time_(hrs)                   

<h3 style="color: #4169E1;"> 2.3 | Dealing with Duplicates</h3>

In [8]:
df.duplicated().sum()

0

<h3 style="color: #4169E1;"> 2.4 | Dealing with columns </h3>

In [None]:
# Delete Columns 
data2.drop(columns=['high', 'low', 'open','close'], inplace=True)

<h3 style="color: #4169E1;"> 2.5 | Moving target to the right </h3>

<h2 style="color: #9370DB;"> 03 | EDA (Exploratory Data Analysis) </h2>

<h3 style="color: #4169E1;">3.1 | Descriptive Statistics </h3>

<h3 style="color: #4169E1;"> 3.2 | Checking Distributions</h3>

<h3 style="color: #4169E1;"> 3.3 | Checking our target distribution</h3>

<h3 style="color: #4169E1;">3.4 | Checking Outliers </h3>

<h3 style="color: #4169E1;">3.5 | Looking for Correlations </h3>

<h2 style="color: #9370DB;"> 04 | Data Processing Optional</h2>

<h3 style="color: #4169E1;"> 4.1 | X-Y Split</h3>

<h3 style="color: #4169E1;"> 4.2 | Selecting the Model</h3>