# Intro to Python Programming for Data Science
## Dr Austin R Brown
## School of Data Science and Analytics
### Kennesaw State University

In [2]:
# === COURSE REPO SETUP === #

# 1. ENTER your GitHub username (the one that owns your fork)
github_username = "mlwilson-ksu"

# 2. Name of the repo (don't change unless your fork name is different)
repo_name = "STAT-7220-Applied-Experimental-Design"

# 3. Build the full repo URL for cloning
repo_url = f"https://github.com/{github_username}/{repo_name}.git"

import os

# --- Detect if we're already in a repo ---
cwd = os.getcwd()
if cwd.endswith(repo_name):
    print(f"✅ Already inside repo folder: {cwd}")
else:
    # --- If the repo folder exists, check if it's nested ---
    if os.path.exists(repo_name):
        print(f"⚠️ Found existing folder '{repo_name}'. Skipping clone to avoid nesting.")
    else:
        print(f"📥 Cloning repo from {repo_url}...")
        os.system(f"git clone {repo_url}")

    # --- Change to repo directory ---
    if os.path.exists(repo_name):
        os.chdir(repo_name)
        print(f"📂 Changed directory to: {os.getcwd()}")
    else:
        print("❌ ERROR: Repo folder not found. Please check your GitHub username.")

# --- Check if this is the instructor's repo instead of student's fork ---
# This command needs to be run from within the repository directory
remote_url = os.popen("git config --get remote.origin.url").read().strip()

if "abrown9008" in remote_url:
    print("⚠️ WARNING: You are working in the instructor's repo, not your fork!")
    print("💡 Please fork the repo to your own account and update `github_username` above.")
else:
    print(f"🔗 Connected to fork at: {remote_url}")

# Set Today's Directory #

today_dir = "Python for Data Science"
os.chdir(today_dir)
print(f"📂 Changed directory to: {os.getcwd()}")

📥 Cloning repo from https://github.com/mlwilson-ksu/STAT-7220-Applied-Experimental-Design.git...
📂 Changed directory to: /content/STAT-7220-Applied-Experimental-Design
🔗 Connected to fork at: https://github.com/mlwilson-ksu/STAT-7220-Applied-Experimental-Design.git
📂 Changed directory to: /content/STAT-7220-Applied-Experimental-Design/Python for Data Science



- When we think about statistics and data science, very often our minds go straight to the analysis of data.

- However, being able to effectively answer questions with data requires a systematic, well-thought-out approach.
    - The analysis is only part of it!

- For us, the scientific method lays out a nice framework that we can use to guide our thinking.

### The Scientific Method

- Remember from high school science class that the scientific method generally involves:
    1. Making a hypothesis about some phenomenon
        - This includes defining our independent (features) and dependent (targets) variables
    2. Collecting data to test the hypothesis
    3. Analyzing the data
    4. Drawing conclusions from the data
    5. Refining the hypothesis and repeating the process

- Even if we aren't working in laboratory sciences, this systematic approach helps us make sure we're using the right data to answer the right question.

- I like to call working through steps 1 - 5 of the scientific method a **study**.

### Types of Studies

- Generally speaking, we can classify studies into two categories:
    1. **Observational Studies**
    2. **Experimental Studies**

- In an **observational study**, we are simply observing our independent and dependent variables. We don't have control over how observational units get assigned to specific values of either the independent or dependent variables.

#### Observational Study Example

- For example, suppose I wanted to know if the mean annual income differs between undergraduate data science majors and psychology majors.

- In this case, mean annual income serves as my quantitative dependent (or outcome) variable and major (data science or psychology) serves as my categorical independent (or predictor/explanatory) variable.

- I as the researcher in this case don't have control over whether students are data science or psychology majors -- I'm simply observing a phenomenon.
    - So this study would be classified as an *observational* study.

- Let's contrast this with an experimental study.

#### Experimental Study Example

- Let's say you work for a marketing department in a retail company. We have an email list of 10,000 customers.

- We want to test the impact of two different email subject lines (e.g., a generic subject line and a personalized subject line) on annual spending with our company.

- In this case, we could randomly assign our customers to either the generic subject line group or the personalized subject line group and follow their spending over the course of a year before making a comparison.

- Email group serves as our categorical independent variable and annual spending serves as our quantitative outcome variable.

- But notice in this case, we the researcher assigned the participants to their respective groups.
    - This is the key difference between observational and experimental research.

### Why Should I Care About Experimentation?

- While we may often associate experiments with laboratory science, experimental design is actually very helpful in a variety of fields including:
    1.  Engineering
    2.  Manufacturing/Quality Control
    3.  Marketing
    4.  Social Science
    5.  Educational Research
    6.  Much more!

- The roots of DOE (design of experiments) go back to Fisher himself as he aimed to study crop yields.

- Over the course of the semester, you will see how DOE is important to:

- **Informed Decision-Making**: It provides a structured approach to testing hypotheses, allowing people/organizations to make data-driven decisions based on reliable evidence.

- **Resource Optimization**: By identifying what works and what doesn't, people/organizations can allocate their resources more efficiently, avoiding wasted time and money on ineffective strategies.

- **Process Improvement**: Well-designed experiments can uncover insights that lead to innovative solutions and improvements, which may serve as a competitive advantage in a business setting.

### Definitions

- Before we go much further, it may be helpful if we have some agreed upon definitions (so we're speaking the same language!)

- Note, if there is ever a time when a term is used in these slides or elsewhere that doesn't seem well-defined, **PLEASE ASK FOR GUIDANCE!!**

- **Experiment (or Run)**: an action where the experimenter changes at least one of the variables being studied and then observes the effect of the action.
    - Randomizing our customers into the generic and personalized email groups and then observing their purchasing behavior was an experiment.


- **Experimental Unit**: the item under study upon which something is changed. This could be raw materials, human subjects, or just a point in time.
    - Our individual customers in the marketing example were our experimental units.

- **Independent Variable (Factor or Treatment Factor or Feature)**: We generally think of this as the $X$ variable that is being controlled at some level during any given experiment.
    - Email group from the prior example

- **Lurking Variable (Background Variable)**: a variable that the experimenter is unaware of or cannot control which could have an effect on the outcome of the experiment.
    - In the email example, annual income probably plays a role in annual spending. This isn't something the marketing department can control.

- **Dependent Variable (or Response or Outcome or Target)**: Usually denoted $Y$, this is the characteristic of the experimental unit that is measured after each experiment/run.

- **Effect**: The change in the response that is caused by a change in a factor/independent variable.

- These definitions will get us started, but there will be more new terms added as we progress through the semester!

### Planning Experiments

- The key to successful experiments (and studies in general really) is a very clear articulation about what you're studying, why you're studying it, how you're studying it, and how you'll draw conclusions from the experiment(s).

- More specifically:
    - **(1)** Define the objective
    - **(2)** Decide what the outcome is (and how it will be measured)
    - **(3)** Determine the independent variables and possible lurking variables
    - **(4)** Choose the design (more on this as the semester progresses)
    - **(5)** Be clear on data collection processes/procedures
    - **(6)** Be clear on which analyses will be performed and how they are appropriate for the objective and design
    - **(7)** Draw conclusions

- As the semester progresses, we will systematically go through each of these steps in every lecture, example, and assignment.

## Why Learn Python?  
- You may be asking yourself, out of all of the possible programming languages which exist, why should I spend time learning Python?

- Great question!

- Python is a useful tool and worthwhile to learn for several reasons:
    1. It's free!
    2. Because it's open source, thousands of people have contributed packages and functions at a pace that proprietary softwares can't compete with
    3. It is a very flexible and robust general programming language, meaning there's a lot you can do with it in the data science space and beyond!
    4. It has become basically the standard in industry

## So What is Python?

- Python is command-line, object-oriented general programming language commonly used for data analysis, data science and statistics.

- **Command-line** means that we have to give it commands in order for us to get it to do something. For example:

In [None]:
## What is the sum of 2 & 3? ##
2 + 3

5

**Object-oriented** means that we can save individual pieces of output as some name that we can use later. This is a super handy feature, especially when you have complicated scripts! For example:

In [None]:
## Save 2 + 3 as "a" ##
a = 2 + 3
print(a)
print(a*2)

5
10


## What Can Python do?

- What can Python do? Well, for the purpose of data analytics, I have yet to find a limit!

- In this class, we will be learning how to use Python as a tool in the data science workflow with specific attention placed on designing and evaluating experiments (more on that in the first week's lecture!)

- What is the data science workflow? Let's take a look!

![From R for Data Science 2nd Edition](https://github.com/mlwilson-ksu/STAT-7220-Applied-Experimental-Design/blob/main/Python%20for%20Data%20Science/Data%20Science%20Workflow.png?raw=1)

## Importing Data/Data Loading

- Since a major reason we use Python is for the analysis of data, we need to know how to import/load data from various sources and file formats into our Python programming environment.

- There are a variety of ways of importing data into our Python programming environment, which largely depend on the type of datafile that you are importing (e.g., Excel file, CSV file, text file, SAS dataset, SPSS dataset, etc.).

- While there are lots of different files which can be imported into our Python programming environment (Google/GenAI is an excellent resource for searching for code for how to start to do something), we're going to focus on two main types: Excel and CSV

- For example, let's try importing a CSV file using Python. This file is part of the famous Framingham Heart Study and is called `HEART.csv` and is located in the `Python for Data Science` subfolder that's part of our class GitHub repo.

- To read in this CSV file using Python, we will use the `read_csv` function, which is part of the famous `pandas` package.

### Defining Packages and Functions

- Okay, but before we get into reading in the `HEART` CSV file, what in the world is a package and function??

- We can think of packages like toolboxes in a mechanic's shop. Each toolbox contains different tools used for specific purposes.

- To access a particular tool, we have to go to the right toolbox.
    - A toolbox is like a package
    - The tools within the toolbox are like functions within a package

- Thus, `read_csv` is a tool (function) within the `pandas` toolbox (package).

- A function can also be thought of like a mathematical function: we provide some input and some specific output is returned. Now, while our Python programming enviroment comes with some functions pre-loaded, almost all others in existence have to be installed from the web, including `pandas`.

- To install a Python package, we have to use a particular command line function called `pip` which is a recursive acroymn for "pip installs packages".
    - In brief, `pip` is a package management system used to install and manage software packages written in Python.

- Since we need `pandas` to load the `HEART.csv` file, we can install by using:

In [None]:
## Install pandas using pip ##
## Remove the # to the left of %pip
## before executing the code ##

#%pip install pandas

- Now, we can load the `pandas` library into our current Python environment by using the `import` function. Note, to access the functions within a package, we have to use the following code syntax: `package_name.function_name`

- So typically when we import a package, we shorten its name to something to allow for brevity in our code.
    - `pandas` is almost universally imported as `pd`

In [None]:
## Load pandas library ##
import pandas as pd

## Load HEART CSV ##
heart = pd.read_csv("HEART.csv")

- Awesome! Now that we've loaded the `heart` dataframe, how do we know that it loaded correctly?

- There are two general approaches I'd recommend. One is a simple overview of the first few rows of the dataframe which we can obtain via the `.head` function.

- The second is by using the `.info` function. This is equivalent to `dplyr::glimpse` in R or `PROC CONTENTS` in SAS.
    - Let's check out how we'd use both techniques here!

In [None]:
## First, what are the number of rows
## and columns in the dataframe?
## Let's use the .shape function! ##

print(heart.shape)

(5209, 17)


- Nice! So we know we have 5209 rows (or observations) and 17 columns (or variables).

- Now let's check out the techniques for inspecting the dataframe!

In [None]:
## .head method ##

print(heart.head(n=5))

  Status DeathCause  AgeCHDdiag     Sex  AgeAtStart  Height  Weight  \
0   Dead      Other         NaN  Female          29   62.50   140.0   
1   Dead     Cancer         NaN  Female          41   59.75   194.0   
2  Alive        NaN         NaN  Female          57   62.25   132.0   
3  Alive        NaN         NaN  Female          39   65.75   158.0   
4  Alive        NaN         NaN    Male          42   66.00   156.0   

   Diastolic  Systolic    MRW  Smoking  AgeAtDeath  Cholesterol Chol_Status  \
0         78       124  121.0      0.0        55.0          NaN         NaN   
1         92       144  183.0      0.0        57.0        181.0   Desirable   
2         90       170  114.0     10.0         NaN        250.0        High   
3         80       128  123.0      0.0         NaN        242.0        High   
4         76       110  116.0     20.0         NaN        281.0        High   

  BP_Status Weight_Status   Smoking_Status  
0    Normal    Overweight       Non-smoker  
1      H

- With `.head`, we can see that the variable `Sex`, for example, is a *categorical* variable, meaning that it's values are qualities (e.g., `Male` or `Female`).

- On the other hand, `Height`, seems to be measured with numbers likely implying that it is a *quantitative* variable.
    - We can use `.info` and the `pyjanitor` package to confirm this!

In [None]:
## .info method ##

print(heart.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5209 entries, 0 to 5208
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Status          5209 non-null   object 
 1   DeathCause      1991 non-null   object 
 2   AgeCHDdiag      1449 non-null   float64
 3   Sex             5209 non-null   object 
 4   AgeAtStart      5209 non-null   int64  
 5   Height          5203 non-null   float64
 6   Weight          5203 non-null   float64
 7   Diastolic       5209 non-null   int64  
 8   Systolic        5209 non-null   int64  
 9   MRW             5203 non-null   float64
 10  Smoking         5173 non-null   float64
 11  AgeAtDeath      1991 non-null   float64
 12  Cholesterol     5057 non-null   float64
 13  Chol_Status     5057 non-null   object 
 14  BP_Status       5209 non-null   object 
 15  Weight_Status   5203 non-null   object 
 16  Smoking_Status  5173 non-null   object 
dtypes: float64(7), int64(3), object(7

- In this output, look again at the `Sex` variable. We see that it has 5209 non-null (non-missing) values, meaning that for every row in the dataframe, we have a valid, non-missing value!

- Then next to this information, we can see that its datatype is `object`. This is Python's naming convention for a nominal, categorical variable.

- For `Height`, we see that is has 6 missing values ($5209 - 5203 = 6$). We also see that Python considers it `float64`. This is Python's naming convention for a continuous, quantitative variable.
    - Notice that `AgeAtStart` is considered `int64`. This is an integer, or discrete, quantitative variable designation.

## Working with Dataframes

- Let's say I wanted to find the average or mean of the `AgeAtStart` column from the `heart` dataframe. How would I go about doing that?

- First, I need to know how to refer to that single variable by itself.

- To do this, we make use of the square bracket notation. Specifically, we can call a single column within a dataframe by using the following syntax: `df['Variable_Name']`.

- You can think of the square bracket like a door to your home. The name of the dataframe is the house itself, the brackets are the door, and the variable name is the person we want to talk to inside of the house.
    - So the structure is `house['Person']`.

- If we run the following command, let's see what happens:

In [None]:
## house["Person"] ##

print(heart['AgeAtStart'])

0       29
1       41
2       57
3       39
4       42
        ..
5204    49
5205    42
5206    51
5207    36
5208    36
Name: AgeAtStart, Length: 5209, dtype: int64


- If we go back and look at the output of the `.head` function, we can see that the values in this output vector correspond to the output we had from that function as well.

- Note, the first element in Python (and many other programming languages) is coded as 0 rather than 1. There are some historical reasons for this but it ends up serving convenience purposes in some operations (i.e., range and length calculations) as well.

- Okay so now that we've established how to directly refer to a variable within a dataframe, how do we calculate the mean? Here, we are going to use the popular `numpy` package to help us out!

In [None]:
## Install numpy. Remember
## to remove # from the left
## of the following line of
## code before executing ##

#%pip install numpy

## Load numpy ##

import numpy as np

## Calculate Mean AgeAtStart ##

print(round(
    np.mean(heart['AgeAtStart']),
    2)
)

44.07


- So the mean age at start is 44.07 years old!

## Tidying Data

### Selecting Columns

- Now, let's say I have a large dataframe with lots of columns of information, as you might see in your own careers.

- But, for whatever analysis I'm wanting to do, I don't need all of the columns, just a few.

- In such a case, it might be useful to subset the dataframe and select only the columns we need.

- How do we go about doing this? Like many things in Python, there are a few different ways to yield the same result, but I'm going to show you what I consider the most straightforward method, which uses the `house["Person"]` syntax we worked with previously.

- Let's say using the `heart` dataframe, I want to create a new dataframe which only contains the last four columns: `Chol_Status`, `BP_Status`, `Weight_Status`, and `Smoking_Status`.


In [None]:
## Creat New Dataframe using subset of
## columns ##

heart1 = heart[['Chol_Status',
                'BP_Status',
                'Weight_Status',
                'Smoking_Status']]

print(heart1.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5209 entries, 0 to 5208
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Chol_Status     5057 non-null   object
 1   BP_Status       5209 non-null   object
 2   Weight_Status   5203 non-null   object
 3   Smoking_Status  5173 non-null   object
dtypes: object(4)
memory usage: 162.9+ KB
None


- We can see that the column selection worked because we have the same number of rows (5209) but now only those four columns we specified.

### Filtering Rows

- We just learned how to subset columns. What if we wanted to subset by values in the rows?

- For example, let's say in the new `heart1` dataframe we just created, we want to create a new dataframe where we only have those participants whose `Weight_Status` is "Overweight."

- Again, there are a few different approaches, but I would recommend using what we've learned so far with the `house['Person']` syntax.


In [None]:
## Create new dataframe from heart1
## where participants' Weight_Status == "Overweight" ##

heart2 = heart1[heart1['Weight_Status'] == "Overweight"]

print(heart2.shape)

(3550, 4)


- Here we can see that the number of columns is the same (4) but now the number of rows is smaller than the total number in the `heart1` dataframe.
    - How can we confirm that the filtering worked correctly?

- One strategy is by counting up the number of participants in the `heart1` dataframe whose `Weight_Status` was considered "Overweight".

- If that number matches the number of rows in `heart2`, then we can feel confident that the filtering worked the way we expected it to.
    - Let's try using the `.value_counts` function to check our work!


In [None]:
## Tabulate Number of Overweight Participants
## in heart1 dataframe ##

print(heart1['Weight_Status'].value_counts())

Weight_Status
Overweight     3550
Normal         1472
Underweight     181
Name: count, dtype: int64


- Since the number of Overweight participants in `heart1` (3550) is equal to the number of rows in `heart2`, we can feel confident our filtering worked the way we anticipated!

## What is "Tidy" Data?

- Once we have imported data, our next job is often to "tidy" it. Tidy data refers to data structure or how information is stored.

- A tidy dataframe has the following characteristics:
    1. Each variable is a column; each column is a variable.
    2. Each observation is a row; each row is an observation.
    3. Each value has is a cell; each cell is a single value.

![From R for Data Science 2nd Edition](https://github.com/mlwilson-ksu/STAT-7220-Applied-Experimental-Design/blob/main/Python%20for%20Data%20Science/Tidy%20Data%20Structure.png?raw=1)

- Why should we care about having our data in tidy format? There are two key reasons:

- First, consistency. It's much easier to work with datasets if we know what format they're in.

- Second, this is generally the structure most Python functions want the data to be in to work correctly.