# Next Steps for Data Analysis

If you have gone through the notebooks in this series, you should have sufficient understanding of all the building blocks for doing Data Analysis in Python to start trying it out on a real-world problem. In this note3book, we will briefly look at some next steps in data analysis.

## More Data! How to handle big datasets

Data Science has developed greatly in the last 20 years and become much more relevant to the daily activities of almost veryone mainly because the huge increase in the possibilities for collecting data in the digital age with widespread availability of relatively cheap IT.With more data, more insight can be gained from analysing the data, but it also greatly increase the challenges associated with processing it. It is very easy to find yourself needing to process a dataset that cannot easily be handled on a standard desktop or laptop computer. In this section, we will look at options for handling bigger datasets.

### Dask

https://dask.org/

In [3]:
import pathlib

In [2]:
import dask.dataframe

Dask uses interoperable iterfaces, so that your code that works with pandas should be able to swap to using dask with minimal (if any) changes. For the XBT data from notebook 
https://docs.dask.org/en/latest/dataframe.html

In [49]:
xbt_datafile = pathlib.Path('/data','xbt-data','csv_dask_clean','xbt_1989.csv')

In [35]:
xbt_ddf = dask.dataframe.read_csv(xbt_datafile)
xbt_ddf

Unnamed: 0_level_0,Unnamed: 0,country,lat,lon,date,year,month,day,institute,platform,cruise_number,instrument,model,manufacturer,max_depth,imeta_applied,id
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
,int64,object,float64,float64,int64,int64,int64,int64,object,object,object,object,object,object,float64,int64,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


As you can see, this has not loaded any data, only the metadata. We can now set up a chain of computation, and then trigger the actual loading and computing at the end.

In [54]:
xbt_path_list = [p1 for p1 in pathlib.Path('/data','xbt-data','csv_clean_dask').iterdir() if 'xbt' in str(p1)]

In [55]:
xbt_ddf = dask.dataframe.concat( [dask.dataframe.read_csv(p1) for p1 in xbt_path_list], ignore_index=True)
xbt_ddf

Unnamed: 0_level_0,Unnamed: 0,country,lat,lon,date,year,month,day,institute,platform,cruise_number,instrument,model,manufacturer,max_depth,imeta_applied,id
npartitions=2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
,int64,object,float64,float64,int64,int64,int64,int64,object,object,object,object,object,object,float64,int64,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [56]:
xbt_t4_ddf = xbt_ddf[xbt_ddf.instrument == 'XBT: T4 (SIPPICAN)']
xbt_t4_ddf

Unnamed: 0_level_0,Unnamed: 0,country,lat,lon,date,year,month,day,institute,platform,cruise_number,instrument,model,manufacturer,max_depth,imeta_applied,id
npartitions=2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
,int64,object,float64,float64,int64,int64,int64,int64,object,object,object,object,object,object,float64,int64,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [57]:
t4_mean_depth = xbt_t4_ddf.max_depth.mean()
t4_mean_depth

dd.Scalar<series-..., dtype=float64>

In [58]:
t4_mean_depth.compute()

377.38020450256903

The important thing about dask is we could set it up a *dask cluster* so the computation happen on another machine, perhaps with more memory or CPUs or faster access to the data.  Using Dask (or an equivalant library) to distribute and manage your compute is a key component in scaling up your data analysis work for real-world data pipelines.

### Cloud computing

* Amazon Web Services (AWS) - https://aws.amazon.com/
* Microsoft Azure - https://azure.microsoft.com/en-gb/
* Google Cloud- https://cloud.google.com/
* Digital Ocean - https://www.digitalocean.com/


## Interactive visualiations

https://docs.bokeh.org/en/latest/index.html

https://plotly.com/

## Machine learning

So far we have been dooing fairly simple analysis operation which just show how we can use the data frameworks. A lot of the time we can gain a lot of insight from commonly used operations such as finding the mean, but real insight is often gained from more complex statistical or machine learning model-based tehcniques. Once you have your data in a data-model based format, it is fairly straightforwad to then use it as input to common machine learning tools, such as *scikit-learn* or *tensorflow*.

* scikit-learn https://scikit-learn.org/stable/
* tensorflow https://www.tensorflow.org/

Here is a quick example of how the XBT dataset could be used input to train a machine-learning model.

In [75]:
import numpy

In [132]:
import sklearn.tree
import sklearn.preprocessing
import sklearn.model_selection
import sklearn.metrics

In [62]:
xbt_df = xbt_ddf.compute()

In [122]:
instr_encoder = sklearn.preprocessing.OneHotEncoder( sparse=False)
instr_array = instr_encoder.fit_transform(xbt_df.instrument.values.reshape(-1,1))

In [99]:
country_encoder = sklearn.preprocessing.LabelEncoder()
country_array = country_encoder.fit_transform(xbt_df['country']).reshape(-1,1)

In [91]:
year_encoder = sklearn.preprocessing.OrdinalEncoder()
year_array = year_encoder.fit_transform(xbt_df['year'].values.reshape(-1,1))

In [92]:
max_depth_encoder = sklearn.preprocessing.Normalizer()
max_depth_array = max_depth_encoder.fit_transform(xbt_df['year'].values.reshape(-1,1))

In [125]:
X = numpy.concatenate([country_array,year_array,max_depth_array], axis=1 )
y = instr_array

In [126]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y)

In [128]:
tree_clf = sklearn.tree.DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

DecisionTreeClassifier()

In [130]:
y_out = tree_clf.predict(X_test)

In [134]:
sklearn.metrics.precision_recall_fscore_support(y_test, y_out, average='micro')

(0.7997581782763485, 0.7427323767935121, 0.770191156968658, None)

## Even less code - Software as a Service (SaaS)

* AWS SageMaker - https://aws.amazon.com/sagemaker/
* AzureML - https://studio.azureml.net/