In this project we will carry out several descriptive statistics analysis without using numpy. The whole idea is to understand the computation principles without using readily available packages.

# Compute Mean and Median

Remember that the mean is calculated as $\bar{x} = n^{-1} \sum_{i=1}^{n} x_i$

Median is calculated by $x_i, i = \frac {n+1}{2}$

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [2]:
N = 12
data = list(np.random.randn(N))
data

[-0.5905593508668208,
 0.635413325891464,
 -0.5064764507105031,
 1.639045095281445,
 0.5388819708011949,
 1.2504160797628903,
 -0.2291743494087044,
 -0.20033093449296202,
 -1.1068587897391549,
 -0.2728044762141864,
 0.4120723660844522,
 1.6413966202754162]

In [3]:
# mean
avg = sum(data)/len(data)
print(avg)
print()
print(np.mean(data))

0.26758509222204424

0.26758509222204424


In [4]:
# Median
# Sort the data
data.sort()

# Find the data point in the middle: floor division
if N%2 == 1:
    mymedian = data[N//2]
else:
    mymedian = (data[N//2-1] + data[N//2])/2
print(mymedian)
print()
print(np.median(data))

0.10587071579574509

0.10587071579574509


## Frequencies Table

Generate 20 random integers between 1 and 5 using numpy

Create two lists: Unique values and their counts

Convert the lists into a dictionary

In [5]:
x = list(np.random.randint(1,6,20))
x

[5, 1, 5, 4, 5, 5, 3, 5, 3, 4, 5, 2, 4, 4, 3, 4, 2, 1, 4, 4]

In [6]:
# Use enumerate
l = [3,6,-4, "asfda"]

for i,j in enumerate(l):
    print(i,j)

0 3
1 6
2 -4
3 asfda


In [7]:
# Unique values using set
uniquevals = set(x)
uniquevals

{1, 2, 3, 4, 5}

In [8]:
# Counts at which values appear in the data
valcounts = [0]*len(uniquevals)
for i in x:
    for j,u in enumerate(uniquevals):
        if i == u:
            valcounts[j] += 1
print(valcounts)

[2, 2, 3, 7, 6]


In [9]:
# Use list comprehension
valcounts = [0]*len(uniquevals)
for i in x:
    idx = [j for j,u in enumerate(uniquevals) if i == u]
    valcounts[idx[0]] += 1

print(valcounts)

[2, 2, 3, 7, 6]


In [10]:
# With Numpy
print(np.array(np.unique(x, return_counts=True)).T)
print(dict(np.array(np.unique(x, return_counts=True)).T))

[[1 2]
 [2 2]
 [3 3]
 [4 7]
 [5 6]]
{1: 2, 2: 2, 3: 3, 4: 7, 5: 6}


In [11]:
# Convert to dictionary
uniquevals = list(uniquevals)
table ={}
for i in range(len(uniquevals)):
    table[uniquevals[i]] = valcounts[i]
table

{1: 2, 2: 2, 3: 3, 4: 7, 5: 6}

## Mode

We will use the dictionary created in the previous video to find the mode

In [12]:
# One approach
maxcount = 0
for i in table.items():
    if i[1] > maxcount:
        maxcount = i[1]
        mode = i[0]
print("Mode %s appears %s times"%(mode, maxcount))

Mode 4 appears 7 times


In [13]:
# A second Approach to get all most frequent values
# First find the maximum value
maxval = max(list(table.values()))

# iterate over dictionary
mode = [k for k,v in table.items() if v==maxval]
print(maxval)
print(mode[0])

7
4


In [14]:
from scipy import stats
stats.mode(x)

ModeResult(mode=array([4]), count=array([7]))

## Standard Deviation

Standrd Deviaion: $\sigma = \sqrt{ \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}$

In [15]:
# Generate Random Poisson Numbers
n = 100
x = list(np.random.poisson(2,n))
meanval = sum(x)/len(x)
meanval

1.93

In [16]:
summation = sum([ (i - meanval)**2 for i in x])
summation

194.51000000000022

In [17]:
# Unbiased STD
sdv = (summation/(n-1))**(1/2)
sdv

1.4016945012189628

In [18]:
# For numpy implementation (biased std)
sdv = (summation/(n))**(1/2)
sdv

1.3946684193742978

In [19]:
np.std(x)

1.3946684193742969

In [20]:
np.std(x, ddof = 1)

1.4016945012189619

## Bonus: Create a csv Report File

Create functions for each statistic

Write the values to a csv file

Download the file to computer

In [21]:
def mymean(data):
    return sum(data)/len(data)
    
def mymedian(data):
    data.sort()
    n = len(data)
    if n%2 ==1:
        m = data[n//2]
    else:
        m = (data[n//2-1] + data[n//2])/2
    return m
    
def mymode(data):
    # Convert to dictionary
    uniquevals = list(set(data))
    valcounts = [0]*len(uniquevals)
    
    for i in data:
        idx = [j for j,u in enumerate(uniquevals) if i == u]
        valcounts[idx[0]] += 1
    
    table ={}
    for i in range(len(uniquevals)):
        table[uniquevals[i]] = valcounts[i]
    
    maxval = max(list(table.values()))

    # iterate over dictionary
    mode = [k for k,v in table.items() if v==maxval]
    
    return mode
    
    
def mystd(data):
    meanval = sum(data)/len(data)
    
    summation = sum([ (i - meanval)**2 for i in data])
    
    sdv = (summation/(n-1))**(1/2)
    
    return sdv

In [22]:
N = 40
data = np.random.randint(10, 41, size = N)
data

array([19, 20, 35, 15, 11, 15, 13, 37, 20, 39, 37, 32, 10, 16, 12, 33, 35,
       35, 36, 12, 38, 22, 38, 26, 25, 25, 10, 34, 14, 11, 38, 39, 25, 18,
       10, 21, 39, 35, 12, 39])

In [23]:
fid = open("stats.csv", "w")

fid.write("mean:" + str(mymean(data)) + '\n')
fid.write("median:" + str(mymedian(data)) + '\n')
fid.write("mode:" + str(mymode(data)) + '\n')
fid.write("std:" + str(mystd(data)) + '\n')

fid.close()