In [None]:

# Trading Features Dataset Cleaning and Analysis

The "Trading-features.csv" dataset contains information related to stock trading features.
This dataset aims to provide insights into various aspects of stock trading, allowing users
to analyze and understand the dynamics of the stock market.

## Approach:

### Load the Dataset:
The script starts by loading the trading features dataset from the CSV file using the pd.read_csv function.

###Basic Information and Missing Values:
The basic information about the dataset is displayed using the df.info() method. This includes the data types of each column and the presence of missing values.
The script checks for missing values in each column using df.isnull().sum() and prints the results.

###Handling Missing Values:
The script provides an example of handling missing values. Two common approaches are shown:
Drop rows with missing values: df = df.dropna()
Fill missing values with a specific value: df = df.fillna(value)

###Check for Duplicates:
The script checks for duplicate rows using df.duplicated().sum() and prints the number of duplicate rows.
If necessary, duplicate rows can be dropped using df = df.drop_duplicates().

###Identify and Handle Outliers:
Numerical columns are identified using df.select_dtypes(include=['float64', 'int64']).columns.
Box plots are created for each numerical column using sns.boxplot. This helps identify outliers and visualize the distribution of values.

###Print Dataset Insights:
The script prints additional insights about the dataset:
Number of rows and columns: print("Number of rows:", len(df)) and print("Number of columns:", len(df.columns)).

###Summary Statistics:
The script prints summary statistics of the dataset using df.describe().

###Save the Cleaned Dataset:
The cleaned dataset is saved to a new CSV file named "Cleaned-Trading-features.csv" using df.to_csv.

###Display Cleaned Dataset Information:
The script displays basic information about the cleaned dataset using df.info() after the cleaning operations.

###Usage:
Ensure the "Trading-features.csv" file is in the correct path.
Execute the script to load, clean, analyze, and save the dataset.
Review the printed information and summary statistics to gain insights into the dataset.

In [40]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [41]:
# Additional settings for Jupyter to display plots inline
%matplotlib inline

# (Optional) Customize default Seaborn style
sns.set(style="whitegrid")

In [12]:
# Load the dataset
file_path = "Trading-features.csv"
df = pd.read_csv(file_path)

# Display basic information about the dataset
print(df.info())

# Check for missing values
print("Missing values:\n", df.isnull().sum())
# Handle missing values
# Example: If there are missing values in a column, you can choose to drop or fill them
# df = df.dropna()  # Drop rows with missing values
# df = df.fillna(value)  # Fill missing values with a specific value

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Columns: 164 entries, id to f161
dtypes: float64(136), int64(27), object(1)
memory usage: 12.5+ MB
None
Missing values:
 id            0
last_price    0
created_at    0
f1            0
f2            0
             ..
f157          0
f158          0
f159          0
f160          0
f161          0
Length: 164, dtype: int64


In [42]:
# Check for duplicates
print("Duplicate rows:", df.duplicated().sum())

# Drop duplicates if necessary
# df = df.drop_duplicates()

Duplicate rows: 0


In [43]:
print("Number of rows:", len(df))

Number of rows: 10000


In [44]:
# Display the shape of the DataFrame after removing outliers
print("DataFrame shape after removing outliers:", df.shape)

# Check if outliers were removed
if df.shape[0] < df.shape[0]:
    print("Outliers were present and have been removed.")
else:
    print("No outliers found.")

DataFrame shape after removing outliers: (10000, 164)
No outliers found.


In [None]:
# Perform other cleaning operations as needed

In [45]:
# Additional insights or summary statistics
print("\nSummary statistics:")
print(df.describe())

# Save the dataset
cleaned_file_path = "Cleaned-Trading-features.csv"
df.to_csv(cleaned_file_path, index=False)

# Display basic information about the cleaned dataset
print("\nCleaned dataset info:\n", df.info())



Summary statistics:
                 id    last_price            f1           f2            f3  \
count  1.000000e+04  10000.000000  10000.000000  10000.00000  10000.000000   
mean   5.546190e+07  42938.580560      0.000414  20149.80990    -40.236010   
std    2.890879e+03    205.486455      0.000018   2897.95985      7.212493   
min    5.545690e+07  42659.500000      0.000397  15139.00000    -57.800000   
25%    5.545940e+07  42747.400000      0.000400  17642.00000    -46.000000   
50%    5.546190e+07  42835.850000      0.000408  20146.00000    -38.200000   
75%    5.546441e+07  43145.225000      0.000417  22652.25000    -34.500000   
max    5.546691e+07  43262.000000      0.000463  25196.00000    -22.100000   

                 f4            f5            f6            f7           f8  \
count  10000.000000  10000.000000  10000.000000  10000.000000  10000.00000   
mean      42.252370      2.016360      0.523120      1.103560      2.29049   
std        9.346334      5.339766     12.8