Skip to content

heytanix/MSML_Repository

Repository files navigation

Cover Image

Mathematics & Statistics for Machine learning - (Experential Learning) - TEAM 3

This repository contains a statistical analysis of the relationship between students' study hours and their academic performance (marks). The analysis follows a structured approach addressing 15 key statistical concepts and techniques.

We strongly recommend cloning the repository using

git clone https://github.com/heytanix/MSML_Repository.git

Please use conv_app.py for reference only, it is not intended to run, For running use app.ipynb

Team Members (Contributors)

  • Thanish Chinnappa KC
  • Likhith V
  • Sahil Patil
  • Sudareshwar S
  • Samith Shivakumar
  • Souharda Mandal

Table of Contents

Libraries Used

The code utilizes several Python libraries for statistical analysis and data visualization:

  • NumPy: Provides support for numerical computing with arrays and mathematical functions
  • Pandas: Used for data manipulation and analysis with DataFrames
  • Matplotlib: Creates static visualizations and plots
  • Seaborn: Built on Matplotlib, provides enhanced statistical visualizations
  • SciPy: Implements various statistical functions and tests
  • Statsmodels: Offers classes and functions for statistical models and hypothesis testing
  • Scikit-learn: Provides machine learning algorithms for regression analysis and evaluation metrics

NumPy Logo Pandas Logo Matplotlib Logo Seaborn Logo SciPy Logo Statsmodels Logo Scikit-learn Logo

Analysis Questions

Corrected Markdown

Analysis Questions

Method used in Question 1

  • Calculated the initial sample size using the formula for a known population standard deviation: ( n_0 = \left( \frac{z_{\alpha} \cdot \sigma}{E} \right)^2 ), where ( z_{\alpha} = 1.96 ) (95% confidence), ( \sigma = 10 ), and ( E = 3 ).
  • Applied Finite Population Correction (FPC) to adjust the sample size: ( n = \frac{n_0}{1 + \frac{n_0 - 1}{N}} ), where ( N = 979 ).
  • Rounded up the results using math.ceil() for practical sample size.

Method used in Question 2

  • Performed Simple Random Sampling using pandas.DataFrame.sample() with n=42, random_state=42, and replace=False to select a sample from the population dataset.
  • Visualized population and sample distributions for 'Marks (out of 100)' and 'Study Hours (per week)' using seaborn.histplot() with KDE.
  • Conducted statistical comparison by computing and comparing means and standard deviations of population and sample using pandas methods (mean(), std()).
  • Generated the sampling distribution of sample means by simulating 1000 samples of size 42 using pandas.sample() and plotting with seaborn.histplot().

Method used in Question 3

  • Loaded the sampled data from Team-3_Sample.csv using pandas.read_csv().
  • Calculated the sample mean and standard deviation for 'Marks (out of 100)' and 'Study Hours (per week)' using pandas methods (mean(), std()).
  • Presented results in a formatted text table.

Method used in Question 4

  • Simulated 1000 samples of size 42 from the population's 'Marks (out of 100)' using pandas.sample() with random_state=42.
  • Computed the mean and standard deviation (standard error) of the sample means using numpy.mean() and numpy.std(ddof=1).
  • Visualized the sampling distribution using seaborn.histplot() with KDE, marking population and sampling distribution means.

Method used in Question 5

  • Reused the sample means from Question 4.
  • Plotted the empirical sampling distribution using seaborn.histplot() with KDE.
  • Overlaid a theoretical normal distribution based on the Central Limit Theorem (CLT) using scipy.stats.norm.pdf() with population mean and standard error (( \sigma/\sqrt{n} )).
  • Added annotations for population and sampling means using matplotlib.pyplot.axvline().

Method used in Question 6

  • Visualized the sampling distribution under the null hypothesis (( H_0: \mu = 77.38 )) using scipy.stats.norm.pdf() with population mean and standard error.
  • Shaded critical regions for a two-tailed test at ( \alpha = 0.05 ) using matplotlib.pyplot.fill_between() and critical z-values from scipy.stats.norm.ppf(0.975).
  • Marked the sample mean (77.14) on the plot using matplotlib.pyplot.axvline().

Method used in Question 7

  • Conducted a two-tailed z-test for the population mean (( H_0: \mu = 77.38 )) using the sample mean (77.14), population standard deviation (( \sigma = 10 )), and sample size (( n = 42 )).
  • Calculated the z-score: ( z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} )
  • Computed the p-value using scipy.stats.norm.cdf() for a two-tailed test.
  • Compared z-score against critical z-values for ( \alpha = 0.05 ) and ( \alpha = 0.01 ) using scipy.stats.norm.ppf().
  • Visualized the z-distribution with rejection regions using matplotlib.pyplot.plot() and fill_between().

Method used in Question 8

  • Calculated confidence intervals for the population mean at 90%, 95%, and 99% confidence levels using the sample mean (77.14), population standard deviation (( \sigma = 10 )), and standard error.
  • Used z-critical values from scipy.stats.norm.ppf() for each confidence level.
  • Computed interval bounds: ( \bar{x} \pm z \cdot \frac{\sigma}{\sqrt{n}} )
  • Visualized the intervals using matplotlib.pyplot.plot() to show the range for each confidence level.

Method used in Question 9

  • Summarized findings from the hypothesis test (Question 7) and confidence intervals (Question 8).
  • Concluded that the sample mean is consistent with the population mean, failing to reject ( H_0 ), based on statistical evidence.
  • Presented results in a formatted text output.

Method used in Question 10

  • Loaded the sample data from Team-3_Sample.csv.
  • Created a scatter plot with a regression line for 'Study Hours (per week)' vs. 'Marks (out of 100)' using seaborn.regplot().
  • Added grid and labels using matplotlib.pyplot for visualization.

Method used in Question 11

  • Calculated the Pearson correlation coefficient between 'Study Hours (per week)' and 'Marks (out of 100)' using scipy.stats.pearsonr().
  • Reported the correlation coefficient in a formatted text output.

Method used in Question 12

  • Derived two regression equations:
    • Marks on Study Hours: ( \text{Marks} = a + b \cdot \text{Study Hours} )
    • Study Hours on Marks: ( \text{Study Hours} = a' + b' \cdot \text{Marks} )
  • Calculated regression coefficients using the Pearson correlation coefficient, sample means, and standard deviations of both variables.
  • Used formulas: ( b = r \cdot \frac{s_y}{s_x} ), ( a = \bar{y} - b \cdot \bar{x} ), and similarly for the second regression.
  • Presented equations in a formatted text output.

Method used in Question 13

  • Used the regression equation from Question 12 (( \text{Marks} = a + b \cdot \text{Study Hours} )) to predict marks for a given study hours value (12 hours).
  • Computed the predicted value and displayed it in a formatted text output.

Method used in Question 14

  • Recalculated the Pearson correlation coefficient and p-value using scipy.stats.pearsonr().
  • Conducted a hypothesis test for the correlation coefficient (( H_0: \rho = 0 )) at ( \alpha = 0.05 ) and ( \alpha = 0.01 ).
  • Compared the p-value to significance levels to determine if the correlation is statistically significant.
  • Presented results and conclusions in a formatted text output.

Method used in Question 15

  • Tested the significance of the regression coefficient (( b )) for the regression of Marks on Study Hours.
  • Calculated the t-statistic using the correlation coefficient: ( t = r \cdot \sqrt{\frac{n - 2}{1 - r^2}} ).
  • Computed the p-value for a two-tailed test using scipy.stats.t.cdf() with ( df = n - 2 ).
  • Compared the p-value to ( \alpha = 0.05 ) and ( \alpha = 0.01 ) to determine significance.
  • Presented results and conclusions in a formatted text output.

Outputs

  • To be viewed within app.ipynb

Conclusion

This code demonstrates a comprehensive statistical analysis workflow, from sampling theory to hypothesis testing and regression analysis. The analysis reveals important insights about the relationship between study hours and academic performance, supported by appropriate statistical tests and visualizations.

License

This project is licensed under the MIT License.

© 2025 Thanish Chinnappa K.C., Likhith, Sahil Vinod Patil , Sundareshwar S, Samith, Souharda Mandal.

About

MSML - Experential Learning repository

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •