# Programming for Data Science

All of the following observations are solely Joe Crumpton's opinions.


## Education

* BS Computer Engineering, MSU
* MS Computer Science, UTK
* PhD Computer Science, MSU

## Software Developer
* BASIC, Pascal, C, C++, Ada, Fortran (Education)
* Smalltalk (Courier Route Planner) and Java (PowerPad Prototype), FedEx
* Python GUI for Netmapper, Distributed Anaylatics and Security Institue, HPCC, MSU
* GIS Developer, Geosystems Research Institute, HPCC, MSU

## Teacher
* Taught High School Math and IT
* Instructor and Asst Clinical Professor, CSE Department, MSU
* Currently an Instructor in the Data Science Academic Institute
    * Mostly courses in the Data Science Pedagogy Certificate program
    * Teaching Applied Programming for Data Science in the MADS program

## What about you?

* Programming Language Poll  
  https://cloud.crumpton.dev/apps/polls/s/qAFBUwfh  
    
  <img src="https://raw.githubusercontent.com/jcrumpton/DSCI_6113_Data/main/DS_Club/Programming%20Language%20Poll.png">
  
  


## Computer Programming

* True in 1987 and true today...
  * All data is 1s and 0s
  * CPUs (and now GPUs) are really, really fast calculators
  * Any computation can be expressed in terms of statement sequences, conditionals, and iteration

* What has changed... 
  * More programming language paradigms widely used: imperative (procedural, object-oriented), functional, logical, ...
  * More programming languages: Go, Rust, F#, Scheme, Kotlin, Scala, ...
  * More computing platforms: WWW, clusters (CPUs and GPUs), mobile devices, embedded devices, robots, ... 

## Learning Computer Programming - Past 

* Mostly in industry or in college computer science courses

* Concerned with basic operations
  * sequences of single statements (`=`, `+` `-` `*` `/` `%`)
  * conditional statements (`if`..`then`..`else`, `switch`)
  * iteration (`for` loops, `while` loops, recursion)
  * functions (repeating code with different inputs)


## Learning Computer Programming - Past

* Built required code from the basics
  * algorithms (search, sort, graph, etc.)
  * data structures (linked lists, sets, hash tables, trees)


* Primarily text based
* Computed theoretical code speed and memory usage (Big _O_ notation)



## Learning Computer Programming - Present

* Occurs at all educational levels and in many college departments (CSE, ECE, BIS, ITS, etc.)
* Still concerned with basic operations as a way to teach problem solving / decomposition
* __Use code libraries for algorithms and data structures__ *
* Use visualizations (GUIs, Web browsers, etc.) in addition to text
* Time / profile code to measure code speed and memory usage *

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;* computer science courses may be exceptions

## Code Comparison Prerequisites

In [None]:
# A list and a set of random numbers will be searched in later cell blocks

import random  

# Create a list of 1000 random numbers
a_list = random.sample(range(10000), 1000)
print(f"List Length: {len(a_list)}  Sum: {sum(a_list)}  {a_list[:5]}")

# Create a set from the list
a_set = set(a_list)
print(f"Set Length:  {len(a_set)}  Sum: {sum(a_set)}")


## Search from Basics

Sequences of Statements, Conditionals, Iteration

In [None]:
%%timeit -r 7 -n 1000
number_search_items = 100

answers = []
for target in range(number_search_items):
    for each in a_list:
        if each == target:
            answers.append(target)
            break

# print(answers)

## Search with some Python Knowledge

Use `in` operator 

In [None]:
%%timeit -r 7 -n 1000
number_search_items = 100

answers = []
for target in range(number_search_items):
    if target in a_list:
        answers.append(target)

# print(answers)

## Pythonic List Search

List Comprehension

In [None]:
%%timeit -r 7 -n 1000
number_search_items = 100

answers = [target for target in range(number_search_items) if target in a_list]

# print(answers)

## Which do you prefer?

* Search Code Preference Poll  
  https://cloud.crumpton.dev/apps/polls/s/UhrlGKH2  
    
  <img src="https://raw.githubusercontent.com/jcrumpton/DSCI_6113_Data/main/DS_Club/Search%20Code%20Preference.png">

## Conclusions

* Basic Operations
  * Basic code is more understandable (maybe?)
  * No large difference in performance
* Pythonic Code
  * Embedded control flow
  * Python programmers prefer Pythonic code  
    https://docs.python-guide.org/writing/style/

## Understanding Strengths of Different Data Structures

`set` is built on top of a hash table  
$O(1)$ searches

In [None]:
%%timeit -r 7 -n 1000
number_search_items = 100

answers = []
for target in range(number_search_items):
    if target in a_set:
        answers.append(target)

# print(answers)

## Using a Set with List Comprehension

In [None]:
%%timeit -r 7 -n 1000
number_search_items = 100

answers = [target for target in range(number_search_items) if target in a_set]

# print(answers)

## Comparison

In [None]:
import matplotlib.pyplot as plt
 
# width of the bars
barWidth = 0.3

sizes = [100, 200, 400]
list_times = [501, 996, 2002]  # microseconds
set_times = [4.2, 7.26, 11.2]
 
# The x position of bars
list_bars = range(len(sizes))
set_bars = [x + barWidth for x in list_bars]

plt.bar(list_bars, list_times, width = barWidth, color = 'blue', edgecolor = 'black', capsize=7, label='list')
plt.bar(set_bars, set_times, width = barWidth, color = 'cyan', edgecolor = 'black', capsize=7, label='set')

# general layout
plt.xticks([r + (barWidth / 2) for r in range(len(sizes))], sizes)
plt.xlabel('number of items searched for')
plt.ylabel('microseconds')
plt.title('Search Times for Lists and Sets')
plt.legend()

plt.show()

## Implications for Programming

* Write code that works, optimize if needed
  * Don't guess, time / profile code
  * Don't have to know why something is fast
  * Eventually make sure that code is using all cores (CPUs/GPUs)
* Learn Strengths and Weaknesses of 
  * Data Structures (Lists, Sets, Dictionaries)
  * Packages (numpy, pandas, matplotlib, seaborn, etc.)


## Another Example

Summary Statistics

In [None]:
# Create a list of 10000 random numbers
import random
a_list = random.sample(range(100000), 10000)

import statistics
list_avg = statistics.mean(a_list)
list_median = statistics.median(a_list)
list_sum = sum(a_list)
print(f"List - {list_avg=} {list_median=} {list_sum=}")

In [None]:
import numpy as np

an_array = np.array(a_list)

array_avg = np.mean(an_array)
array_median = np.median(an_array)
array_sum = np.sum(an_array)
print(f"Array - {array_avg=} {array_median=} {array_sum=}")

In [None]:
%%timeit -r 7 -n 100

list_avg = statistics.mean(a_list)
list_median = statistics.median(a_list)
list_sum = sum(a_list)


In [None]:
%%timeit -r 7 -n 100

array_avg = np.mean(an_array)
array_median = np.median(an_array)
array_sum = np.sum(an_array)

## Learning Python

* Basics
  * https://www.w3schools.com/python/
  * https://learn-python.adamemery.dev/
  * https://replit.com/@BrianTheado/python-koans
  * https://www.freecodecamp.org/news/learn-python-basics/

* NumPy
  * https://www.w3schools.com/python/numpy/
  * https://numpy.org/doc/stable/user/absolute_beginners.html
  
* Pandas
  * https://www.w3schools.com/python/pandas/
  * https://www.kaggle.com/learn/pandas
  * https://www.datacamp.com/tutorial/pandas
  * https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/


## Programming for Data Science

<center>
    <img src="https://raw.githubusercontent.com/jcrumpton/DSCI_6113_Data/main/DS_Club/Data%20Lifecycle.png">
</center>

<center>
    <img src="https://raw.githubusercontent.com/jcrumpton/DSCI_6113_Data/main/DS_Club/Process%20Data.png">
</center>

<center>
    <img src="https://raw.githubusercontent.com/jcrumpton/DSCI_6113_Data/main/DS_Club/Process%20Data%20Steps.png">
</center>

## Tools to Learn / Use

* Load and Clean Data  
  Use Pandas unless you have a reason not to (data too large to fit in memory)
* Explore Data  
  Pandas, numpy, matplotlib
* Models  
  SciPy, scikit-learn  
  https://www.w3schools.com/python/python_ml_getting_started.asp  
  https://www.kaggle.com/learn/intro-to-machine-learning  
  Deep Learning: Pytorch, Keras, TensorFlow  
  https://www.datacamp.com/tutorial/pytorch-vs-tensorflow-vs-keras

## Questions / Comments

* Where do we go from here?