# Introduction to Probability and Statistics
For dis notebook, we go play with some of di concepts we don tok before. Plenty concepts from probability and statistics dey well show for big libraries wey dey do data processing for Python, like `numpy` and `pandas`.


In [None]:
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt

## Random Variables and Distributions
Make we start by draw sample of 30 values from one uniform distribution wey dey from 0 go 9. We go also calculate mean and variance.


In [None]:
sample = [ random.randint(0,10) for _ in range(30) ]
print(f"Sample: {sample}")
print(f"Mean = {np.mean(sample)}")
print(f"Variance = {np.var(sample)}")

To visually estimate how many different values dem get for the sample, we fit plot di **histogram**:


In [None]:
plt.hist(sample)
plt.show()

## Analyzing Real Data

Mean and variance na very important tin wen you dey analyze real-world data. Make we load di data about baseball players from [SOCR MLB Height/Weight Data](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights)


In [None]:
df = pd.read_csv("../../data/SOCR_MLB.tsv",sep='\t', header=None, names=['Name','Team','Role','Weight','Height','Age'])
df


> We dey use one package wey dem dey call [**Pandas**](https://pandas.pydata.org/) here for data analysis. We go talk more about Pandas and how to dey work with data for Python later for this course.

Make we calculate average values for age, height and weight:


In [None]:
df[['Age','Height','Weight']].mean()

Now make we focus on height, and calculate standard deviation and variance:


In [None]:
print(list(df['Height'])[:20])

In [None]:
mean = df['Height'].mean()
var = df['Height'].var()
std = df['Height'].std()
print(f"Mean = {mean}\nVariance = {var}\nStandard Deviation = {std}")

Besides mean, e dey make sense to look di median value and quartiles. Dem fit show for **box plot**:


In [None]:
plt.figure(figsize=(10,2))
plt.boxplot(df['Height'].ffill(), vert=False, showmeans=True)
plt.grid(color='gray', linestyle='dotted')
plt.tight_layout()
plt.show()

Wi fit also make box plots of small parts of our dataset, for example, group by player role.


In [None]:
df.boxplot(column='Height', by='Role', figsize=(10,8))
plt.xticks(rotation='vertical')
plt.tight_layout()
plt.show()

> **Note**: Dis diagram dey show say, for average, di height of first basemen taller pass di height of second basemen. Later we go learn how to test dis hypothesis for forma way, and how to show say our data get statistical meaning to prove am.  

Age, height and weight na continuous random variables. Wetin you think na di distribution dem get? One beta way to sabi na to plot di histogram of values: 


In [None]:
df['Weight'].hist(bins=15, figsize=(10,6))
plt.suptitle('Weight distribution of MLB Players')
plt.xlabel('Weight')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Normal Distribution

Make we create one artificial sample of weights wey follow normal distribution wey get the same mean and variance as our real data:


In [None]:
generated = np.random.normal(mean, std, 1000)
generated[:20]

In [None]:
plt.figure(figsize=(10,6))
plt.hist(generated, bins=15)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10,6))
plt.hist(np.random.normal(0,1,50000), bins=300)
plt.tight_layout()
plt.show()

Since most values for real life dey normally distributed, we no suppose use uniform random number generator to generate sample data. Na wetin go happen if we try generate weights with uniform distribution (wey `np.random.rand` generate):


In [None]:
wrong_sample = np.random.rand(1000)*2*std+mean-std
plt.figure(figsize=(10,6))
plt.hist(wrong_sample)
plt.tight_layout()
plt.show()

## Confidence Intervals

Make we calculate confidence intervals for the weights and heights of baseball players now. We go use the code [from this stackoverflow discussion](https://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data):


In [None]:
import scipy.stats

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, h

for p in [0.85, 0.9, 0.95]:
    m, h = mean_confidence_interval(df['Weight'].fillna(method='pad'),p)
    print(f"p={p:.2f}, mean = {m:.2f} Â± {h:.2f}")

## Hypothesis Testing

Make we check different role dem for our baseball players dataset:


In [None]:
df.groupby('Role').agg({ 'Weight' : 'mean', 'Height' : 'mean', 'Age' : 'count'}).rename(columns={ 'Age' : 'Count'})

Make we test di hypothesis say First Basemen tall pass Second Basemen. Di easiest way to do dis na to test di confidence intervals:


In [None]:
for p in [0.85,0.9,0.95]:
    m1, h1 = mean_confidence_interval(df.loc[df['Role']=='First_Baseman',['Height']],p)
    m2, h2 = mean_confidence_interval(df.loc[df['Role']=='Second_Baseman',['Height']],p)
    print(f'Conf={p:.2f}, 1st basemen height: {m1-h1[0]:.2f}..{m1+h1[0]:.2f}, 2nd basemen height: {m2-h2[0]:.2f}..{m2+h2[0]:.2f}')

We fit see say di intervals no dey overlap.

One statistically correct way to take prove di hypothesis na to use **Student t-test**:


In [None]:
from scipy.stats import ttest_ind

tval, pval = ttest_ind(df.loc[df['Role']=='First_Baseman',['Height']], df.loc[df['Role']=='Second_Baseman',['Height']],equal_var=False)
print(f"T-value = {tval[0]:.2f}\nP-value: {pval[0]}")

Di two values wey di `ttest_ind` function return na:
* p-value fit be considered as di chance say two distributions get di same mean. For our case, e low well well, meaning say e get strong evidence wey dey support say first basemen tall pass.
* t-value na di intermediate value of normalized mean difference wey dem dey use for di t-test, and dem dey compare am against one threshold value based on di confidence value wey dem set.


## Simulating a Normal Distribution wit di Central Limit Theorem

Di pseudo-random generator wey dey Python na him dem design to give us uniform distribution. If we wan create generator for normal distribution, we fit use di central limit theorem. To get normal distributed value, we go just compute di mean of uniform-generated sample.


In [None]:
def normal_random(sample_size=100):
    sample = [random.uniform(0,1) for _ in range(sample_size) ]
    return sum(sample)/sample_size

sample = [normal_random() for _ in range(100)]
plt.figure(figsize=(10,6))
plt.hist(sample)
plt.tight_layout()
plt.show()

## Correlation and Evil Baseball Corp

Correlation dey allow us find relationship between data sequences. For our toy example, mek we pretend say one evil baseball corporation dey wey dey pay dia players base on how tall dem be - di taller di player be, di more money e go get. Make we suppose say base salary na $1000, plus bonus wey fit be from $0 to $100, depending on height. We go use real players from MLB, come calculate dia imaginary salaries:


In [None]:
heights = df['Height'].fillna(method='pad')
salaries = 1000+(heights-heights.min())/(heights.max()-heights.mean())*100
print(list(zip(heights, salaries))[:10])

Make we calculate covariance and correlation of those sequences now. `np.cov` go give us wetin dem dey call **covariance matrix**, wey be extension of covariance to plenti variables. The element $M_{ij}$ for covariance matrix $M$ na correlation between input variables $X_i$ and $X_j$, and the diagonal values $M_{ii}$ na the variance of $X_{i}$. Likewise, `np.corrcoef` go give us the **correlation matrix**.


In [None]:
print(f"Covariance matrix:\n{np.cov(heights, salaries)}")
print(f"Covariance = {np.cov(heights, salaries)[0,1]}")
print(f"Correlation = {np.corrcoef(heights, salaries)[0,1]}")

One correlation wey equal to 1 mean say e get strong **linear relation** between two variables. We fit see di linear relation clearly by plotting one value against the oda:


In [None]:
plt.figure(figsize=(10,6))
plt.scatter(heights,salaries)
plt.tight_layout()
plt.show()

Make we see wetin go happen if the relation no be linear. Suppose say our company decide hide the clear linear dependence between heights and salaries, and put some non-linearity inside the formula, like `sin`:


In [None]:
salaries = 1000+np.sin((heights-heights.min())/(heights.max()-heights.mean()))*100
print(f"Correlation = {np.corrcoef(heights, salaries)[0,1]}")

For dis kain case, di correlation small small reduce, but e still high well well. Now, to make di relation no too clear, we fit add some extra randomness by adding some random variable for di salary. Make we see wetin go happen:


In [None]:
salaries = 1000+np.sin((heights-heights.min())/(heights.max()-heights.mean()))*100+np.random.random(size=len(heights))*20-10
print(f"Correlation = {np.corrcoef(heights, salaries)[0,1]}")

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(heights, salaries)
plt.tight_layout()
plt.show()

> You fit guess why di dots dem line up to vertical lines like dis?

We don observe di connection between one artificial engineered idea like salary and di thing wey we dey observe *height*. Make we also check if di two observed variables, like height and weight, get connection too:


In [None]:
np.corrcoef(df['Height'].ffill(),df['Weight'])

Unfortunately, we no see any results - na only some kind strange `nan` values. Dis one happen because some of di values inside our series no get definition, wey dem represent as `nan`, wey cause di result of di operation to be no defined too. If we look di matrix, we fit see say `Weight` na di column wey dey cause wahala, because self-correlation between `Height` values don already calculate.

> Dis example show how **data preparation** and **cleaning** important. Without correct data, we no fit calculate anything.

Make we use `fillna` method to fill di missing values, then calculate di correlation:


In [None]:
np.corrcoef(df['Height'].fillna(method='pad'), df['Weight'])

True true, e get correlation, but e no strong like for our artificial example. If we look the scatter plot wey show one value against the other, the relation no go too clear:


In [None]:
plt.figure(figsize=(10,6))
plt.scatter(df['Weight'],df['Height'])
plt.xlabel('Weight')
plt.ylabel('Height')
plt.tight_layout()
plt.show()

## Conclusion

For dis notebook we don learn how to perform basic operations on data to calculate statistical functions. Now we sabi how to use correct math and statistics tools to prove some hypotheses, and how to calculate confidence intervals for any variables wey get data sample.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:
Dis dokument don translate wit AI translation service wey dem dey call [Co-op Translator](https://github.com/Azure/co-op-translator). Even though we dey try make am correct, abeg sabi say automatic translation fit get some mistake or no too clear. The original dokument wey dem write for im correct language na the real correct source. If na serious matter, better make professional human person translate am. We no go take responsibility if person no understand or misunderstand because of dis translation.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
