<a href="https://colab.research.google.com/github/mohankishoregorle/statistical-analysis/blob/main/Mohan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Descriptive*** ***Statistics***:

Mean:

 Displays the average values of all features in the dataset, including mean radius.
Median:

 Shows the middle value when the data is ordered, which helps in understanding the central tendency of features.
Mode:

Provides the most frequent values, which can reveal common characteristics in the dataset.
Standard Deviation:

 Indicates the dispersion of the feature values from the mean.
Variance:
 Measures the spread of the values in the dataset.
Range:

 Shows the difference between the maximum and minimum values for each feature.
Skewness:

 Describes the asymmetry of the feature values' distribution.
Kurtosis:

 Measures the "tailedness" of the feature values' distribution.
One-Sample T-Test:

Purpose:

 Assesses whether the mean radius of tumors differs significantly from a hypothetical population mean (14.0 in this case).
T-Statistic and P-Value: Provides statistical evidence regarding the hypothesis. If the p-value is below a chosen significance level (e.g., 0.05), it suggests that the mean radius is significantly different from 14.0.
Confidence Interval:

95% Confidence Interval:

 Provides a range within which the true mean of the mean radius is likely to fall with 95% confidence. This interval helps understand the precision of the sample mean.
Linear Regression:

Model Fitting:

 The Ordinary Least Squares (OLS) regression model is used to understand the relationship between mean radius and the target variable (tumor malignancy).
Model Summary:

 Displays the regression results, including coefficients, R-squared value, and statistical significance of predictors. The summary helps determine if mean radius is a significant predictor of tumor malignancy.
Interpretation:

The descriptive statistics provide a comprehensive overview of the mean radius and other features in the dataset.
The one-sample t-test helps assess if the average mean radius is significantly different from a predefined value, which could be useful for understanding the typical size of tumors.
The confidence interval offers a range for the average tumor size, providing insight into the variability of the mean radius.
The linear regression model examines whether the mean radius is a useful predictor of tumor malignancy. If significant, it indicates that changes in mean radius could be associated with changes in tumor malignancy.

In [5]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from scipy import stats
import numpy as np
import statsmodels.api as sm

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

# Display the first few rows
print(df.head())

# Calculate basic descriptive statistics
print("Mean:\n", df.mean())
print("\nMedian:\n", df.median())
print("\nMode:\n", df.mode().iloc[0])
print("\nStandard Deviation:\n", df.std())
print("\nVariance:\n", df.var())

# Additional descriptive statistics
print("\nRange:\n", df.max() - df.min())
print("\nSkewness:\n", df.skew())
print("\nKurtosis:\n", df.kurt())

# Example data: Mean Radius values
mean_radius_values = df['mean radius']

# Hypothetical population mean for Mean Radius
population_mean = 14.0

# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(mean_radius_values, population_mean)

print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")

# Sample mean and standard error for Mean Radius
sample_mean = np.mean(mean_radius_values)
standard_error = stats.sem(mean_radius_values)

# Compute 95% confidence interval for Mean Radius
confidence_interval = stats.norm.interval(0.95, loc=sample_mean, scale=standard_error)

print(f"95% Confidence Interval for Mean Radius: {confidence_interval}")

# Define independent variable (add constant for intercept)
X = sm.add_constant(df['mean radius'])

# Define dependent variable (target variable)
y = df['target']

# Fit linear regression model
model = sm.OLS(y, X).fit()

# Print model summary
print(model.summary())

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             

**conclusion**:

The code provides a thorough statistical analysis of the mean radius feature in the Breast Cancer dataset. By examining descriptive statistics, performing hypothesis testing, computing confidence intervals, and fitting a linear regression model, the code offers insights into the distribution of tumor sizes and their potential impact on predicting tumor malignancy. This comprehensive approach helps in understanding the role of mean radius in breast cancer diagnostics and its significance as a predictive feature.
