<a href="https://colab.research.google.com/github/lohith-00/STATISTICAL-ANALYSIS/blob/main/lohith.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Descriptive Statistics**
**Basic Statistics:**

**Mean, Median, Mode**: These measures give an overview of the central tendency of the dataset's features.

**Standard Deviation and Variance: **These indicate the spread or dispersion of the data around the mean.

**Additional Statistics:**

**Range: **The difference between the maximum and minimum values in the dataset, giving a sense of the data's spread.

**Skewness**: Measures the asymmetry of the data distribution.

**Kurtosis: **Indicates the "tailedness" of the data distribution.

**One-Sample T-Test:**

The one-sample t-test compares the mean alcohol content of the dataset to a hypothetical population mean of 12.5.
The resulting t-statistic and p-value indicate whether the mean alcohol content significantly differs from this population mean.
The 95% confidence interval for the mean alcohol content provides a range where we are 95% confident the true mean lies.
Logistic Regression for Binary Classification
Binary Target Creation:

The target variable was converted into a binary format (target 0 vs. others), allowing us to apply binary logistic regression.
Statsmodels Logistic Regression:

A logistic regression model was fitted using the statsmodels library.
The model summary provides details such as coefficients, standard errors, z-values, and p-values, helping assess the impact of alcohol content on the binary target.
Scikit-learn Logistic Regression:

A logistic regression model was also fitted using scikit-learn for comparison.
The model was evaluated using accuracy, which measures the proportion of correctly classified instances in the test set.

In [1]:
import pandas as pd
from sklearn.datasets import load_wine
from scipy import stats
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
df['target'] = wine.target

# Display the first few rows
print(df.head())

# Calculate basic descriptive statistics
print("Mean:\n", df.mean())
print("\nMedian:\n", df.median())
print("\nMode:\n", df.mode().iloc[0])
print("\nStandard Deviation:\n", df.std())
print("\nVariance:\n", df.var())

# Additional descriptive statistics
print("\nRange:\n", df.max() - df.min())
print("\nSkewness:\n", df.skew())
print("\nKurtosis:\n", df.kurt())

# Example data: Alcohol content values
alcohol_values = df['alcohol']

# Hypothetical population mean for Alcohol
population_mean = 12.5

# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(alcohol_values, population_mean)

print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")

# Sample mean and standard error for Alcohol
sample_mean = np.mean(alcohol_values)
standard_error = stats.sem(alcohol_values)

# Compute 95% confidence interval for Alcohol
confidence_interval = stats.norm.interval(0.95, loc=sample_mean, scale=standard_error)

print(f"95% Confidence Interval for Alcohol: {confidence_interval}")

# For classification, we'll focus on binary classification (target 0 vs others)
df['binary_target'] = (df['target'] == 0).astype(int)

# Define independent variable (add constant for intercept)
X = sm.add_constant(df['alcohol'])

# Define dependent variable (binary target)
y = df['binary_target']

# Fit logistic regression model
model = sm.Logit(y, X).fit()

# Print model summary
print(model.summary())

# Logistic regression using scikit-learn for comparison
X_train, X_test, y_train, y_test = train_test_split(df[['alcohol']], df['binary_target'], test_size=0.2, random_state=42)
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predict on the test set
y_pred = log_reg.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Logistic Regression model: {accuracy * 100:.2f}%")

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  target  
0          

**CONCLUSION:**

In summary, the code demonstrates a robust approach to data exploration, hypothesis testing, and classification modeling. The results from both statistical and machine learning approaches provide a well-rounded view of the dataset and the relationship between alcohol content and the binary target.




