# Programming for Data Analysis Assignment

![pic](https://moriohcdn.b-cdn.net/ff3cc511fb.png)

# PROBLEM STATEMENT
For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.
We suggest you use the numpy.random package for this purpose.
Specifically, in this project you should:
• Choose a real-world phenomenon that can be measured and for which you could
collect at least one-hundred data points across at least four different variables.
• Investigate the types of variables involved, their likely distributions, and their
relationships with each other.
• Synthesise/simulate a data set as closely matching their properties as possible.
• Detail your research and implement the simulation in a Jupyter notebook – the
data set itself can simply be displayed in an output cell within the notebook.
Note that this project is about simulation – you must synthesise a data set. Some
students may already have some real-world data sets in their own files. It is okay to
base your synthesised data set on these should you wish (please reference it if you do),
but the main task in this project is to create a synthesised data set. The next section
gives an example project idea.

# INTRODUCTION
As a golf enthusiast myself I picked the real-world phenomenon of the performance of golfers 
playing professionally on the PGA TOUR and DP World Tour. After some research, I decide that the most interesting
variable related to this is the Average Score of each player calculated over the 2022/2023 season - this is going to be
one of my variables (Average Score). The other variables are rank, strokes and rounds.

# INVESTIGATION OF REAL WORLD DATA PHENOMENON
## COLLECTION OF DATA POINTS

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import scipy
from scipy.stats import norm
import math
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# https://www.pgatour.com/stats/detail/120
df = pd.read_csv('pgatour_2022-2023_actualaveragescore.csv')
df

In [None]:
df = df.drop('MOVEMENT',axis=1)

In [None]:
df = df.drop('PLAYER_ID',axis=1)

In [None]:
df

In [None]:
df["TOUR"] = "PGA TOUR"
df

##  DISTRIBUTIONS
- Normal Distribution
- Uniform distribution

In [None]:
# Average Score follows a Normal Distribution
# Plot Average Score data from pgatour_golfstats_2022-2023_averagescore.csv
AVG = df.AVG
plt.hist(AVG)
plt.show()

In [None]:
RANK = df.RANK
RANK

In [None]:
RANK = np.array(RANK)

In [None]:
# RANK is uniformly distributed
plt.hist(RANK)
plt.show()

#  SYNTHESIZE / SIMULATE A DATA SET

In [None]:
syn_rank =np.arange(1,192,1)
syn_rank

In [None]:
x, counts = np.unique(syn_rank, return_counts=True)
x, counts

In [None]:
plt.hist(syn_rank)
plt.show()

In [None]:
low = 1
high = 191
size = 191

In [None]:
rng = np.random.default_rng()
rand_ints = rng.integers(low=low, high=high+1, size=size)
rand_ints

In [None]:
x, counts = np.unique(rand_ints, return_counts=True)
x, counts

In [None]:
# Create an empty plot.
fig, ax = plt.subplots(figsize=(12, 3))

# Plot a bar chart.
ax.bar(x, counts);

In [None]:
plt.hist(rand_ints)
plt.show()

In [None]:
# Create an empty plot.
fig, ax = plt.subplots(figsize=(12, 3))

# Plot a bar chart.
ax.bar(x, counts);

In [None]:
np.random.randint(1,191)

In [None]:
L = [np.random.randint(1,191) for i in range(191)]
L

In [None]:
plt.hist(L)
plt.show()

In [None]:
mean = df.AVG.mean()
mean

In [None]:
std = df.AVG.std()
std

In [None]:
size = len(df.index)
size

In [None]:
# Synthesize a random normal distribution for Average Score
norm_y = np.random.normal(mean,std,size)
norm_y

In [None]:
df = df.rename(columns={'TOTAL STROKES': 'STROKES', 'TOTAL ROUNDS': 'ROUNDS'})
df

In [None]:
# Total Strokes follows a Normal Distribution
# Plot Total Strokes data from pgatour_golfstats_2022-2023_averagescore.csv
STROKES = df.STROKES
plt.hist(STROKES)
plt.show()

In [None]:
mean_strokes = df.STROKES.mean()
mean_strokes

In [None]:
std_strokes = df.STROKES.std()
std_strokes

In [None]:
# Synthesize a random normal distribution for STROKES
norm_strokes = np.random.normal(mean_strokes,std_strokes,size)
norm_strokes

In [None]:
# ROUNDS follows a Normal Distribution
# Plot ROUNDS data from pgatour_golfstats_2022-2023_averagescore.csv
ROUNDS = df.ROUNDS
plt.hist(ROUNDS)
plt.show()

In [None]:
mean_rounds = df.ROUNDS.mean()
mean_rounds

In [None]:
std_rounds = df.ROUNDS.std()
std_rounds

In [None]:
# Synthesize a random normal distribution for ROUNDS
norm_rounds = np.random.normal(mean_rounds,std_rounds,size)
norm_rounds

##  DATA FRAME OF SYNTHESIZED DATA

In [None]:
syn_df = pd.DataFrame(dict(RANK=np.arange(1,192,1),
                       AVERAGE=np.random.normal(mean,std,size), STROKES=np.random.normal(mean_strokes,std_strokes,size), ROUNDS = np.random.normal(mean_rounds,std_rounds,size)),
                  columns=['RANK', 'AVERAGE', 'STROKES', 'ROUNDS'])
syn_df

## DATA TYPES

In [None]:
syn_df.dtypes

In [None]:
#https://stackoverflow.com/questions/66969078/set-decimal-precision-of-a-pandas-dataframe-column-with-a-datatype-of-decimal
syn_df.AVERAGE = syn_df.AVERAGE.round(2)
syn_df.AVERAGE

**Integers**

$\mathbb{Z} = \{ \ldots, -3, -2, -1, 0, 1, 2, 3, \ldots \}$

**Naturals**

$\mathbb{N} = \{1, 2, 3, \ldots\}$

$\mathbb{N}_0 = \{0, 1, 2, 3, \ldots\}$

**Reals**

$ \mathbb{R} $

![Real Number Line](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Real_number_line.svg/689px-Real_number_line.svg.png)

In [None]:
syn_df.STROKES = syn_df.STROKES.astype(int)
syn_df.ROUNDS = syn_df.ROUNDS.astype(int)
syn_df

In [None]:
syn_df["TOUR"] = "PGA TOUR"
syn_df

In [None]:
#syn_df = syn_df.sort_values(['Average'], ascending=False)
#syn_df

In [None]:
#df = df.rename(columns={'TOTAL STROKES': 'STROKES', 'oldName2': 'newName2'})
#strokes = df.TOTAL STROKES
#plt.hist(strokes)
#plt.show()

In [None]:
plt.plot(df.AVG)
plt.show()

In [None]:
plt.scatter(RANK,AVG)
plt.show()

In [None]:
RANK = np.array(RANK)
RANK

In [None]:
AVG = np.array(AVG)
AVG

## BEST-FIT DISTRIBUTION

In [None]:
from scipy import stats
dist = stats.norm
data = AVG
bounds = [(68, 73), (0, 191)]
res = stats.fit(dist, data, bounds)
res

In [None]:
res.params

In [None]:
res.plot()
plt.show()

In [None]:
import seaborn as sns
from scipy.stats import norm

data = norm.rvs(70.30324607329841,0.6373654137997954,size=191) # you can use a pandas series or a list if you want

sns.distplot(data)
plt.show()

#  CORRELATION and LINEAR REGRESSION

In [None]:
correlation = np.corrcoef(RANK, AVG)
correlation

In [None]:
x = RANK.reshape(-1,1)
y = AVG.reshape(-1,1)

In [None]:
lr = LinearRegression()
lr.fit(x,y)

In [None]:
lr.predict([[5]])

In [None]:
pred_y = lr.predict(x)
pred_y

In [None]:
plt.plot(x,pred_y )
plt.show()

In [None]:
colour="red"
plt.scatter(x,y)
plt.plot(x,pred_y,colour )
plt.show()

In [None]:
df.AVG

In [None]:
df.describe()

In [None]:
count = df.count()
count

In [None]:
# Plot random normal distribution for Average Score
plt.hist(y)
plt.show()

In [None]:
x = np.array(x)
x

In [None]:
y = np.array(y)
y

In [None]:
#from scipy import stats
#dist = stats.norm
#data = y
#res = stats.fit(dist, data,bounds=[(69,73), (0, 50)])
#res

In [None]:
#res.params

In [None]:
#res.plot()
#plt.show()

In [None]:
#data = y

#sns.distplot(data)
#plt.show()

In [None]:
#ks.test(df, "pnorm", mean=mean, sd=std)

In [None]:
#fitdist(y, "norm")

In [None]:
x = df.RANK
plt.scatter(x,y)
plt.show()

In [None]:
#correlation = np.corrcoef(x, y)
#correlation

In [None]:
#x = x.reshape(-1,1)
#y = y.reshape(-1,1)

In [None]:
#lr = LinearRegression()
#lr.fit(x,y)

In [None]:
#pred_y = lr.predict(x)
#pred_y

In [None]:
#plt.plot(x,pred_y )
#plt.show()

In [None]:
#colour="red"
#plt.scatter(x,y)
#plt.plot(x,pred_y,colour )
#plt.show()

In [None]:
# Synthesize a poisson distribution for Average Score
p = np.random.default_rng().poisson(mean, 191)
p

In [None]:
# Plot poisson distribution for Average Score
plt.hist(p)
plt.show()

In [None]:
plt.scatter(x,p)
plt.show()

## SEABORN PAIRPLOTS

In [None]:
sns.pairplot(data=df)

In [None]:
#correlation = np.corrcoef(x, p)
#correlation

In [None]:
# https://www.pgatour.com/stats/detail/101
df1 = pd.read_csv('pgatour_golfstats_2022-2023_driving.csv')
df1

In [None]:
df1 = df1.drop('MOVEMENT', axis=1)
df1 = df1.drop('PLAYER_ID', axis=1)

In [None]:
df1["TOUR"] = "PGA TOUR"
df1

In [None]:
x = df1.RANK
y = df1.AVG
# Creating histogram 
fig, ax = plt.subplots(1, 1) 
ax.plot(x,y) 
  
# Set title 
ax.set_title("Longest Drivers on PGA TOUR") 
  
# adding labels 
ax.set_xlabel('RANK') 
ax.set_ylabel('Driving Distance') 

plt.show()

In [None]:
plt.scatter(x,y)
plt.show()

In [None]:
correlation = np.corrcoef(x, y)
correlation

In [None]:
# Plot Driving Distance data from pgatour_golfstats_2022-2023_driving.csv
# Creating histogram 
fig, ax = plt.subplots(1, 1) 
ax.hist(y) 
  
# Set title 
ax.set_title("PGA TOUR") 
  
# adding labels 
ax.set_xlabel('Driving Distance') 
ax.set_ylabel('Number of Players') 

plt.show()

In [None]:
sns.pairplot(data=df1)

In [None]:
df1.AVG

In [None]:
std = df1.AVG.std()
std

In [None]:
mean = df1.AVG.mean()
mean

In [None]:
df1.describe()

In [None]:
count = df1.count()
count

In [None]:
# Synthesize a random normal distribution for Driving Distance
# I choose mu of 300.5 because this is the average drivage distance in pgatour_golfstats_2022-2023_driving.csv
mu, sigma = 300.96, 8.66 # mean and standard deviation
s = np.random.default_rng().normal(mu, sigma, 189)
s

In [None]:
x, counts = np.unique(s, return_counts=True)
x


In [None]:
counts

In [None]:
# Plot random normal distribution for Driving Distance
# Creating histogram 
fig, ax = plt.subplots(1, 1) 
ax.hist(s) 
  
# Set title 
ax.set_title("Normal Distribution") 
  
# adding labels 
ax.set_xlabel('Driving Distance') 
ax.set_ylabel('Number of Players') 

plt.show()

In [None]:
plt.plot(np.random.default_rng().normal(mu, sigma, 189))
plt.show()

In [None]:
colour="red"
# Plot between -10 and 10 with .001 steps.
x_axis = np.arange(260, 340, 1)
# Mean = 0, SD = 2.
plt.plot(x_axis, norm.pdf(x_axis,300.96402116402106,8.66203975287581),colour)
plt.show()

In [None]:
#def my_gauss(x, sigma=sigma, h=h, mean=mean):
#    from math import exp, pow
#    variance = pow(sigma, 2)
#    return h * exp(-pow(x-mean, 2)/(2*variance))

In [None]:
#h = 40
#mean = mean
#variance = pow(sigma, 2)
#sigma = math.sqrt(variance)
#x = np.linspace(mu - 3*sigma, mu + 3*sigma, 1)
#plt.plot(x, scipy.stats.norm.pdf(x, mu, sigma))
#plt.show()
#my_gauss

In [None]:
value = np.random.normal(loc=300.96402116402106,scale=8.66203975287581,size=189)
sns.displot(value)
sns.lineplot(value)

In [None]:
# Synthesize a poisson distribution for Driving Distance
mu=300.96402116402106
p = np.random.default_rng().poisson(mu, len(df1.index))
p

In [None]:
# Plot poisson distribution for Driving Distance

# Creating histogram 
fig, ax = plt.subplots(1, 1) 
ax.hist(p) 
  
# Set title 
ax.set_title("Poisson Distribution") 
  
# adding labels 
ax.set_xlabel('Driving Distance') 
ax.set_ylabel('Number of Players') 

plt.show()

In [None]:
dataframe1 = pd.read_excel('dpworldtour_2022-2023_scoringaverage.xlsx')
dataframe1

In [None]:
dataframe1 = dataframe1.drop('COUNTRY',axis=1)

In [None]:
dataframe1

In [None]:
print(dataframe1.columns)

In [None]:
#dataframe1.columns = ['RANK', 'Unnamed: 1', 'PLAYER', 'ROUNDS', 'AVG', 'TOUR']
dataframe1.columns

In [None]:
dataframe1.iloc[0]

In [None]:
concat = pd.concat([df, dataframe1], ignore_index=True)
concat

In [None]:
pgatour = concat[concat['TOUR']=='PGA TOUR']
dpworldtour = concat[concat['TOUR']=='DP World Tour']

In [None]:
pgatour.mean()

In [None]:
dpworldtour.mean()

In [None]:
mean_groupby = concat.groupby(['TOUR']).AVG.mean()
mean_groupby

In [None]:
sns.pairplot(data = concat)
sns.pairplot(concat, hue="TOUR", palette="rainbow")

## TIME SERIES

In [None]:
dataframe2 = pd.read_excel('Shane Lowry_2022-2-23_Results_time-series.xlsx')
dataframe2

In [None]:
dataframe2['datetime'] = pd.to_datetime(dataframe2['DATE'])
dataframe2

In [None]:
dataframe2['datetime']

In [None]:
import seaborn as sns
sns.lineplot(data=dataframe2, x="datetime", y="WINNINGS")

In [None]:
dti = pd.date_range("2022-10-23", periods=52, freq="W")
len(dti)
dti

In [None]:
winnings = dataframe2['WINNINGS']
winnings

In [None]:
dataframe2['WINNINGS'] = dataframe2['WINNINGS'].replace({r'\$':''}, regex = True)
dataframe2['WINNINGS'] = dataframe2['WINNINGS'].replace({r'\,':''}, regex = True)
dataframe2['WINNINGS'] = dataframe2['WINNINGS'].replace({r'\-':'0'}, regex = True)
winnings = dataframe2['WINNINGS']
winnings

In [None]:
dataframe2['WINNINGS'] = pd.to_numeric(dataframe2['WINNINGS'])
dataframe2['WINNINGS']

In [None]:
sns.lineplot(data=dataframe2, x="datetime", y="WINNINGS")

In [None]:
#sns.pairplot(data=dataframe2)

In [None]:
mean = dataframe2.WINNINGS.mean()
mean

In [None]:
sigma = dataframe2.WINNINGS.std()
sigma

In [None]:
rng = np.random.default_rng() 
d = rng.poisson(mean, len(dti))
df = pd.DataFrame(data=d, index = dti)
df.head()

In [None]:
df.columns =['WINNINGS']
df.head()

In [None]:
sns.lineplot(data=df, x=df.index, y="WINNINGS")

In [None]:
df.plot(y="WINNINGS")

In [None]:
d = rng.normal(mean,sigma,len(dti))
df = pd.DataFrame(data=d, index = dti)
df.head()

In [None]:
df.columns =['WINNINGS']
df.head()

In [None]:
df.plot(y="WINNINGS")

In [None]:
sns.lineplot(data=df, x=df.index, y="WINNINGS")

In [None]:
#sns.pairplot(data=df)

In [None]:
def my_gauss(x, sigma=sigma, h=450000, mean=mean):
    from math import exp, pow
    variance = pow(sigma, 2)
    return h * exp(-pow(x-mean, 2)/(2*variance))

In [None]:
x=df.WINNINGS
yg = [my_gauss(xi) for xi in x]
yg

In [None]:
plt.hist(yg)

In [None]:
# Range of x values for plotting.
x = np.linspace(250, 350, 189)
x

In [None]:
#f(x)

In [None]:
# Create empty plot.
#fig, ax = plt.subplots(figsize=(12, 4))

# Plot f(x).
#ax.plot(x, f(x));

In [None]:
rng.integers(10,25)

In [None]:
rng.integers(100, size=10)

In [None]:
rng.integers(50,100, size=10)

In [None]:
rng.integers(100, size=(2,4))

In [None]:
rng.random()

In [None]:
rng.random(5)

In [None]:
rng.random((5,4))

In [None]:
x = rng.random(1000000)
x

In [None]:
import matplotlib.pyplot as plt
plt.hist(x)
plt.show()

In [None]:
rng = np.random.default_rng(seed=43)
x = rng.random()
print(x)

In [None]:
rng = np.random.default_rng(seed=44)
x = rng.integers(10)
print(x)

In [None]:
rng = np.random.default_rng(seed=43)
x = rng.random((2,4))
print(x)

In [None]:
rng = np.random.default_rng(seed=43)
x = rng.integers(50,100,size=(3,3))
print(x)

# RESULTS

# CONCLUSION

# REFERENCES / RESEARCH

Set decimal precision of a pandas dataframe column with a datatype of Decimal
https://stackoverflow.com/questions/66969078/set-decimal-precision-of-a-pandas-dataframe-column-with-a-datatype-of-decimal

Populate Pandas Dataframe with normal distribution
https://stackoverflow.com/questions/58996519/populate-pandas-dataframe-with-normal-distribution

How to create a DataFrame of random integers with Pandas?
https://stackoverflow.com/questions/32752292/how-to-create-a-dataframe-of-random-integers-with-pandas

Discrete uniform distribution
https://en.wikipedia.org/wiki/Discrete_uniform_distribution

Normal distribution
https://en.wikipedia.org/wiki/Normal_distribution

How to choose number of bins in numpy.histogram?
https://stackoverflow.com/questions/47607250/how-to-choose-number-of-bins-in-numpy-histogram

Count number of elements in a specific bin
https://stackoverflow.com/questions/55482071/count-number-of-elements-in-a-specific-bin

Exploratory data analysis in Python.
https://nbviewer.org/github/Tanu-N-Prabhu/Python/blob/master/Exploratory_data_Analysis.ipynb

How to add a new column to an existing DataFrame
https://stackoverflow.com/questions/12555323/how-to-add-a-new-column-to-an-existing-dataframe

GOLF STATISTICS
https://datagolf.com/

The synthetic data platform for developers.
https://gretel.ai/

Change the data type of a column or a Pandas Series
https://www.geeksforgeeks.org/change-the-data-type-of-a-column-or-a-pandas-series/

Fitting a Normal distribution to 1D data
https://stackoverflow.com/questions/20011122/fitting-a-normal-distribution-to-1d-data

SCIPY FIT FUNCTION
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fit.html#scipy.stats.fit

Fitting empirical distribution to theoretical ones with Scipy (Python)?
https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-ones-with-scipy-python

SK LEARN Linear Regression
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

API for numpy.random.Generator.poisson distribution method
https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.poisson.html#numpy.random.Generator.poisson

API for pandas.DataFrame
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame

Pandas Time Series function
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

Remove Dollar Sign from Entire Python Pandas Dataframe
https://stackoverflow.com/questions/43096522/remove-dollar-sign-from-entire-python-pandas-dataframe

Merge, join, concatenate and compare Dataframes
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html