-
Notifications
You must be signed in to change notification settings - Fork 89
/
data_exploration_in_python_tutorial.txt
244 lines (191 loc) · 6.8 KB
/
data_exploration_in_python_tutorial.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
This tutorial is for data exploration using python. There are about 70 functions available
in daexp.py. In this tutorial I will provide bunch of expamples showing how to use the API
Setup
=====
Make sure you have ../lib ../mlextra and directories with all the python files wrt
where your script is. You need to have all the python libraries mentioned in the blog installed.
Basic summary statistics
========================
Very basic stats
code:
sys.path.append(os.path.abspath("../mlextra"))
from daexp import *
exp = DataExplorer()
exp.addFileNumericData("bord.txt", 0, 1, "pdemand", "demand")
exp.getStats("pdemand")
output:
== adding numeric columns from a file ==
done
== getting summary statistics for data sets pdemand ==
{ 'kurtosis': -0.12152386739702337,
'length': 1000,
'mad': 2575.2762,
'max': 18912,
'mean': 10920.908,
'median': 11011.5,
'min': 3521,
'mode': 10350,
'mode count': 3,
'n largest': [18912, 18894, 17977, 17811, 17805],
'n smallest': [3521, 3802, 4185, 4473, 4536],
'skew': -0.009681701835865877,
'std': 2569.1597609989144}
Check if data is Gaussian
=========================
We will use shapiro wilks test (there are few others available)for the same data set loded
in the previous example..
code:
exp.testNormalShapWilk("demand")
output:
== doing shapiro wilks normalcy test for data sets demand ==
result details:
{'pvalue': 0.02554553933441639, 'stat': 0.9965143203735352}
test result:
stat: 0.997
pvalue: 0.026
significance level: 0.050
probably not gaussian
Find outliers in data
=====================
We will find outliers in data if any using isolation forest algorithm
code:
exp.addFileNumericData("sale1.txt", 0, "sale")
exp.getOutliersWithIsoForest(.002, "sale")
output:
== adding numeric columns from a file ==
done
== getting outliers using isolation forest for data sets sale ==
result details:
{ 'dataWithoutOutliers': array([[1006],
[1076],
[1107],
[1066],
[ 954],
.......
[1044],
[ 939],
[ 876]]),
'numOutliers': 2,
'outliers': array([[5000],
[ 832]])}
We found 2 outliers
Find auto correlation peaks
===========================
We are going to find aut correlation secondary peak, that will tell what the seasonal cycle is if any
code:
exp.addFileNumericData("sale.txt", 0, "sale")
exp.getAutoCorr("sale", 20)
output:
== adding numeric columns from a file ==
done
== getting auto correlation for data sets sale ==
result details:
{ 'autoCorr': array([ 1. , 0.5738174 , -0.20129608, -0.82667856, -0.82392299,
-0.20331679, 0.56991343, 0.91427488, 0.5679168 , -0.20108015,
-0.81710428, -0.8175842 , -0.20391004, 0.56864915, 0.90936982,
0.56528676, -0.20657182, -0.81111562, -0.81204275, -0.1970099 ,
0.56175539]),
'confIntv': array([[ 1. , 1. ],
[ 0.5118379 , 0.6357969 ],
[-0.28111578, -0.12147637],
[-0.90842511, -0.74493201],
[-0.93316119, -0.71468479],
[-0.33426918, -0.07236441],
[ 0.43775398, 0.70207288],
[ 0.77298956, 1.0555602 ],
[ 0.40548625, 0.73034734],
[-0.37096731, -0.03119298],
[-0.98790327, -0.64630529],
[-1.00279183, -0.63237657],
[-0.40249873, -0.00532136],
[ 0.36925779, 0.76804052],
[ 0.70384298, 1.11489665],
[ 0.34484471, 0.7857288 ],
[-0.43251377, 0.01937013],
[-1.03778192, -0.58444933],
[-1.04959751, -0.57448798],
[-0.44499878, 0.05097898],
[ 0.313166 , 0.81034477]])}
Saving some note on finding and save workspace
==============================================
In the previous test we found a auto correlation peak at 7. We are going to save this in a
not and save the workspace
code:
exp.addNote("sale", "auto correlation peak found at 7")
exp.save("./model/daexp/exp.mod")
output:
== adding note ==
done
== saving workspace ==
done
Restore workspace and extract time series components
=====================================================
We are going to restore the work space, look at our notes and then extract time series
components
code:
exp.restore("./model/daexp/exp.mod")
exp.getNotes("sale")
exp.getTimeSeriesComponents("sale","additive", 7, True, False)
output:
== restring workspace ==
done
== getting notes ==
auto correlation peak found at 7
== extracting trend, cycle and residue components of time series for data sets sale ==
result details:
{ 'residueMean': 0.022420235699977295,
'residueStdDev': 19.14825253159541,
'seasonalAmp': 98.22786720321932,
'trendMean': 1004.9323081345215,
'trendSlope': -0.0048913825348870996}
Notice we didn't call any add data API. We are using the datasets saved in the previous session
and restored in the current session. But you could add additional data sets
Find out if data is stationary
==============================
We are using the same restored workspace. The data has trend and seasonality and hemce not
stationary. The following test confirms that because pvalue is less that sigificant level
rejecting null hypothesis of being stationary
code:
exp.testStationaryKpss("sale", "c", None)
output:
== doing KPSS stationary test for data sets sale ==
/usr/local/lib/python3.7/site-packages/statsmodels/tsa/stattools.py:1685: FutureWarning:
The behavior of using lags=None will change in the next release. Currently lags=None is the
same as lags='legacy', and so a sample-size lag length is used. After the next release,
the default will change to be the same as lags='auto' which uses an automatic lag length
selection method. To silence this warning, either use 'auto' or 'legacy'
warn(msg, FutureWarning)
result details:
{ 'critial values': {'1%': 0.739, '10%': 0.347, '2.5%': 0.574, '5%': 0.463},
'num lags': 22,
'pvalue': 0.04146105558567144,
'stat': 0.5009129131996188}
test result:
stat: 0.501
pvalue: 0.041
significance level: 0.050
probably not stationary
Find out if 2 samples are from the same distribution
====================================================
This something you may want to know if your deployed machine learning model needs retraining
because the data has drifted. We are using Kolmogorov Sminov test. There are few other options
available in the API
code:
exp.addFileNumericData("hsale.txt", 0, "hsale")
exp.addFileNumericData("sale.txt", 0, "sale")
exp.testTwoSampleKs("hsale", "sale")
output:
== adding numeric columns from a file ==
done
== adding numeric columns from a file ==
done
== doing Kolmogorov Sminov 2 sample test for data sets hsale sale ==
result details:
{'pvalue': 0.0, 'stat': 0.836}
test result:
stat: 0.836
pvalue: 0.000
significance level: 0.050
probably not same distribution
I have provided few sample use cases for the API. Create your own data exploration story and feel
free play around and learn about your data