## Some Basic Statistics

Unlike certain programming environments, (such as __R__), python was not designed specifically for statistics and data science. Consequently, many of the methods that are available in data manipulation software and statistical analysis software, (SAS, Minitab, GenStat, and so on), are not directly available in python. However, there are a number of high quality libraries available for this. THis section introduces some of these.

## Regression with scikit-learn

The package [scikit-learn](https://scikit-learn.org/stable/) is a widely used Python library for machine learning and is built on top of NumPy and some other packages. It provides the means for preprocessing data, reducing dimensionality, implementing regression, classification, clustering, and more. Like NumPy, scikit-learn is also open source.

### Installing scikit-learn

Warning - this could take some time depending on what you have already got installed on your system.

In [1]:
! python -m pip install scikit-learn

Collecting scikit-learn
  Using cached scikit_learn-1.0.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.7 MB)
Collecting numpy>=1.14.6
  Downloading numpy-1.21.3-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 7.8 MB/s eta 0:00:01     |████████████████████████▌       | 12.0 MB 4.3 MB/s eta 0:00:01
[?25hCollecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Collecting scipy>=1.1.0
  Using cached scipy-1.7.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (28.5 MB)
Collecting joblib>=0.11
  Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Installing collected packages: numpy, threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.1.0 numpy-1.21.3 scikit-learn-1.0.1 scipy-1.7.1 threadpoolctl-3.0.0
You should consider upgrading via the '/home/grosedj/python-envs/M550/env/bin/python -m pip install --upgrade pip' command.[0m


### Importing scikit-learn for linear regression

In [2]:
import numpy as np
from sklearn.linear_model import LinearRegression

####  create data

In [3]:
x = np.array([[5, 15, 25, 35, 45, 55]]).T # note - the .T is a shorthand way of transposing the array
y = np.array([5, 20, 14, 32, 22, 38])
print(x)
print(y)

[[ 5]
 [15]
 [25]
 [35]
 [45]
 [55]]
[ 5 20 14 32 22 38]


#### create a model

In [4]:
model = LinearRegression()

#### fit model

In [5]:
result = model.fit(x,y)

#### examine results

In [6]:
print(result.intercept_)
print(result.coef_)

5.633333333333329
[0.54]


## <u>Exercise 1</u>

What is the type of result.coef_ ? Why do you think it is this type ?

In [7]:
type(result.coef_)

numpy.ndarray

### Processing the results

#### predict the data from the model

In [8]:
y_hat = model.predict(x)
print(y_hat)

[ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333 35.33333333]


#### Import and use some model metrics from scikit learn

In [9]:
from sklearn import metrics

print(metrics.mean_absolute_error(y, y_hat))
print(metrics.mean_squared_error(y, y_hat))

5.466666666666666
33.75555555555555


## Multiple regression

When the data becomes more complex it can often be better to use a pandas data frame instead of an ordinary numpy array. 

### <u>Example 2</u>

#### Install pandas

In [10]:
! python -m pip install pandas

Collecting pandas
  Using cached pandas-1.3.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
Collecting pytz>=2017.3
  Using cached pytz-2021.3-py2.py3-none-any.whl (503 kB)
Installing collected packages: pytz, pandas
Successfully installed pandas-1.3.4 pytz-2021.3
You should consider upgrading via the '/home/grosedj/python-envs/M550/env/bin/python -m pip install --upgrade pip' command.[0m


#### Import pandas

In [11]:
import pandas as pd

#### download the data

<a href="./data/timber.txt" download>timber.txt</a>

#### load the data into a pandas data frame

In [12]:
timber = pd.read_csv('./data/timber.txt', delim_whitespace=True)
print(timber)

    volume   girth  height
0   0.7458   66.23    21.0
1   0.7458   68.62    19.5
2   0.7386   70.22    18.9
3   1.1875   83.79    21.6
4   1.3613   85.38    24.3
5   1.4265   86.18    24.9
6   1.1296   87.78    19.8
7   1.3179   87.78    22.5
8   1.6365   88.57    24.0
9   1.4410   89.37    22.5
10  1.7524   90.17    23.7
11  1.5206   90.97    22.8
12  1.5496   90.97    22.8
13  1.5424   93.36    20.7
14  1.3831   95.76    22.5
15  1.6075  102.94    22.2
16  2.4475  102.94    25.5
17  1.9841  106.13    25.8
18  1.8610  109.32    21.3
19  1.8030  110.12    19.2
20  2.4982  111.71    23.4
21  2.2954  113.31    24.0
22  2.6285  115.70    22.2
23  2.7734  127.67    21.6
24  3.0847  130.07    23.1
25  4.0116  138.05    24.3
26  4.0333  139.64    24.6
27  4.2216  142.84    24.0
28  3.7292  143.63    24.0
29  3.6930  143.63    24.0
30  5.5757  164.38    26.1


#### columns can be accessed for a pandas data frame using the name of the column

In [13]:
timber[["girth","height"]]

Unnamed: 0,girth,height
0,66.23,21.0
1,68.62,19.5
2,70.22,18.9
3,83.79,21.6
4,85.38,24.3
5,86.18,24.9
6,87.78,19.8
7,87.78,22.5
8,88.57,24.0
9,89.37,22.5


In [14]:
results = model.fit(timber[["girth","height"]],timber[["volume"]])

In [15]:
print(results.intercept_)
print(results.coef_)

[-4.19899732]
[[0.04272511 0.08188343]]


### <u>Exercise 2</u>

Download this <a href="./data/Pollute.txt" download>pollution data</a>, load it into a data frame, and regress Pollution against the other variables using scikit-learn.

# Hypothesis tests with scipy.statistics

Another useful library for statistics and data science is [scipy](https://www.scipy.org/). NumPy and pandas are actually part of scipy, as is matplotlib, a popular data visulisation package. he scipy package is quite large and organised into multiple sub modules. For hypothesis tests, it is scipy.stats module which is of interest.

### <u>Example 3</u>

#### Shapiro-Wilk Test

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.


In [17]:
from scipy.stats import shapiro
data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
stat, p = shapiro(data)
print((stat, p))
if p > 0.05:
    print('Probably Gaussian')
else:
    print('Probably not Gaussian')

(0.8951009511947632, 0.19340917468070984)
Probably Gaussian


In [20]:
import scipy.stats

### <u>Exercise 2</u>

Import all of the scipy stats module using the alias sps

In [23]:
import scipy.stats as sps

### <u>Exercise 3</u>

Use the __help__ function to find out what other tests are available from scipy.stats

In [24]:
help(sps.stats)

Help on module scipy.stats.stats in scipy.stats:

NAME
    scipy.stats.stats - A collection of basic statistical functions for Python.

DESCRIPTION
    References
    ----------
    .. [CRCProbStat2000] Zwillinger, D. and Kokoska, S. (2000). CRC Standard
       Probability and Statistics Tables and Formulae. Chapman & Hall: New
       York. 2000.

CLASSES
    
     |  or if all the inputs have length 1.
     |  
     |  Method resolution order:
     |      builtins.Exception
     |      builtins.BaseException
     |      builtins.object
     |  
     |  Data descriptors defined here:
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  
     |  __init__(self, /, *args, **kwargs)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  ----------------------------------------------------------------------
     |  
     