In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the dataset
df = pd.read_csv('data/financial_ratios_2018_from_raw.csv')

df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

## Lets make some plots by sector for each financial ratio in our dataset

First let's create some functions that process the feature and return histograms, boxplots and the statistics.

In [None]:
def plot_histograms_and_descriptive_stats(df, feature):
    import seaborn as sns
    import matplotlib.pyplot as plt

    sectors = df['Sector'].unique()

    plt.figure(figsize=(15, 25))

    # Prepare a DataFrame to store descriptive statistics
    desc_stats = []

    for i, sector in enumerate(sectors, 1):
        sector_data = df[df['Sector'] == sector][feature].dropna()
        stats = sector_data.describe()
        desc_stats.append({
            "Sector": sector,
            "Mean": stats['mean'],
            "Median": stats['50%'],
            "Standard Deviation": stats['std'],
            "Min": stats['min'],
            "Max": stats['max']
        })

        plt.subplot(len(sectors) // 2 + 1, 2, i)
        sns.histplot(sector_data, kde=False, bins=20, color='purple')
        plt.title(f'{feature} - {sector}')
        plt.xlabel(f'{feature}')
        plt.ylabel('Frequency')
        plt.grid(True)
        plt.axvline(x=stats['mean'], color='r', linestyle='--', label=f"Mean: {stats['mean']:.2f}")
        plt.axvline(x=stats['50%'], color='b', linestyle='-', label=f"Median: {stats['50%']:.2f}")
        plt.legend()

    plt.tight_layout()
    plt.show()

    # Convert the list of dictionaries to a DataFrame for better visualization
    desc_stats_df = pd.DataFrame(desc_stats)
    return desc_stats_df

def plot_boxplots_by_sector(df, feature):
    sectors = df['Sector'].unique()

    plt.figure(figsize=(15, 25))

    for i, sector in enumerate(sectors, 1):
        plt.subplot(len(sectors) // 2 + 1, 2, i)
        sns.boxplot(x='Sector', y=feature, data=df[df['Sector'] == sector])
        plt.title(f'{feature} - {sector}')
        plt.xticks(rotation=90)
        plt.grid(True)

    plt.tight_layout()
    plt.show()
    
def plot_cleaned_individual_boxplots_by_sector(df, feature, threshold=1.5):
    def remove_outliers(group, threshold=1.5):
        Q1 = group[feature].quantile(0.25)
        Q3 = group[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - threshold * IQR
        upper_bound = Q3 + threshold * IQR
        return group[(group[feature] >= lower_bound) & (group[feature] <= upper_bound)]

    sectors = df['Sector'].unique()

    plt.figure(figsize=(15, 25))

    for i, sector in enumerate(sectors, 1):
        cleaned_data = remove_outliers(df[df['Sector'] == sector])
        sector_data = cleaned_data[feature].dropna()
        plt.subplot(len(sectors) // 2 + 1, 2, i)
        plt.boxplot(sector_data)
        plt.title(f'{feature} - {sector}')
        plt.xlabel(f'{sector}')
        plt.ylabel(f'{feature}')
        plt.grid(True)

    plt.tight_layout()
    plt.show()

### 1 Price to Book Value Ratio

The Price to Book Value Ratio (P/B ratio) is calculated using the formula:

Price to Book Value Ratio=
Market Price Per Share/
Book Value Per Share

 

This ratio measures the market's valuation of a company relative to its book value. A higher ratio suggests that the market values the company more than its accounting value, often due to expected growth or profitability. Conversely, a lower ratio may indicate undervaluation or intrinsic issues within the company.

Let's analyze the descriptive statistics for the Price to Book Value Ratio by sector and check for any possible anomalies that might indicate data issues. I'll compute the mean, median, standard deviation, minimum, and maximum for each sector, alongside the histograms.

In [None]:
desc_stats_df = plot_histograms_and_descriptive_stats(df, 'priceBookValueRatio')

In [None]:
desc_stats_df

### Observations:
Outliers and Anomalies: Some sectors like Financial Services and Healthcare show extremely high means and standard deviations due to significant outliers. This can indicate potential data errors or uniquely high ratios that need further investigation.

Skewness: Most sectors show a positive skew, with medians generally lower than the means, indicating that higher values (outliers) are pulling the mean upwards.

Sector Variance: The variance across sectors in terms of spread and central tendencies highlights different valuation practices and market perceptions.

### Analysis Implications:
Data Validation: Extreme values, especially in sectors like Financial Services and Healthcare, should be verified for accuracy. These might be errors or special cases like companies with very small book values and relatively high stock prices.

Investment Insight: Understanding typical and outlier values in each sector can help identify investment opportunities and risks, emphasizing the need for sector-specific valuation norms.

This sector-specific breakdown with descriptive statistics provides a deeper insight into the financial characteristics and potential anomalies in the dataset, useful for refined analysis and strategic decision-making. ​

### We should create boxplots now to see the outliers.

In [None]:
plot_boxplots_by_sector(df,'priceBookValueRatio')

Observations:
Extreme Outliers: Some sectors, such as Financial Services and Healthcare, exhibit extreme outliers. For example, the Financial Services sector has a ratio that goes up to 
1
0
8
10 
8
 , which is unusually high and likely indicates erroneous data.
Variability Across Sectors: The variability of the price to book value ratio differs significantly among sectors. Sectors like Basic Materials and Utilities show a relatively tighter range, whereas sectors like Technology and Industrials have more spread and visible outliers.
Skewed Distributions: Many sectors show right-skewed distributions, indicating that most companies have lower ratios with a few having very high values. This skewness is especially pronounced in sectors like Technology and Healthcare.
Central Tendencies: The central box (interquartile range) for most sectors is quite compressed, suggesting that the majority of the data points are closely packed, but outliers are distorting the visual representation.
Analysis:
Data Quality Concerns: The presence of extreme outliers, particularly in sectors like Financial Services and Healthcare, suggests that there may be data quality issues. These could be due to incorrect data entries or specific cases where the book value is very low, causing the ratio to spike.
Sector-Specific Characteristics: The differences in variability and central tendency across sectors highlight the importance of context when interpreting financial ratios. For instance, the Technology sector might naturally have a wider range of ratios due to varying growth expectations.
Need for Data Cleaning: To get a more accurate picture, it might be useful to apply outlier detection and removal techniques. This could involve setting a threshold based on the interquartile range (IQR) to filter out extreme values.
Recommendations:
Outlier Removal: Implement a function to remove outliers based on the IQR method. This can help in getting a clearer view of the typical values within each sector.
Sector-Specific Analysis: Conduct deeper dives into each sector separately, considering the context and characteristics unique to that sector.
Data Validation: Review and validate the data entries, especially for sectors with extreme outliers, to ensure accuracy.


In [None]:

# Example usage:
plot_cleaned_individual_boxplots_by_sector(df, 'priceBookValueRatio')
