<div style="text-align:center;">
    <div style="font-size: 24pt; font-weight:bold;">
        Introduction to Data Science (CIS 3813)<br>
        Fall 2025
    </div>
    <div style="font-size: 18pt;">
        Week 1: Introduction to Data Science
    </div>
</div>

**Date:** 25 August 2025  
**Time:** 6:00–9:00 PM  
**Instructor:** Dr. Patrick T. Marsh  

Welcome! This notebook introduces the data science lifecycle, course tooling, and a quick hands-on with Python, Pandas, and Matplotlib.

### **Learning goals**
- Explain the data science lifecycle and core roles.
- Configure a working Python data science environment.
- Run a short EDA (exploratory data analysis) in Pandas and produce a simple chart.

## **Agenda (Week 1)**
1. [Welcome and Course Orientation](#welcome-and-course-orientation)
2. [The Data Science Lifecycle and Roles](#the-data-science-lifecycle-and-roles)
3. [Tools of the Trade](#tools-of-the-trade)
4. [Hands-on Lab: Configuring Our Environment](#hands-on-lab-configuring-our-environment)
5. [Mini EDA Demonstration](#mini-eda-demonstration)
6. [HW 1 Assignment: Real-World Examples](#hw-1-real-world-examples)


## **Welcome and Course Orientation**
Below are some highlights from the course syllabus. You can find the full details including a course schedule on Canvas. 
- **Meetings:** Mondays 6–9 PM
- **Office Hours:** Mondays & Wednesdays 4:30–5:50 PM (and by appointment / virtual)- **Graded Components:** 
    - Homework (25%)
    - Labs/Participation (10%)
    - Quizzes (10%)
    - Midterm Project (15%)
    - Final Project (25%) 
    - Final Exam (15%)
- **AI Use:** Allowed with attribution and full understanding; include your prompts when applicable.

## **The Data Science Lifecycle and Roles**

Below is a commonly used workflow. We'll revist it throughout the semester. 

<div style="text-align:center;">

**Problem $\longleftrightarrow$ Data Acquistion $\longleftrightarrow$ Cleaning/Preparation $\longleftrightarrow$ Exploration/Visualization $\longleftrightarrow$ Modeling/Inference $\longleftrightarrow$ Evaluation $\longleftrightarrow$ Communication/Deployment**

</div>

**Notice that the arrows point in both directions.** In practice the arrows are a bit more muddled. You can end up jumping from any point of the process to any other point of the process &mdash; multiple times &mdash; as you discover issues or new questions. 

**Other Key Ideas:**
- *Context-First:* Clarify objectives and stakeholders before touching data!
- *Reproducible:* Use notebooks, version control, and clear documentation!

**Industry Roles & Responsibiilites** (non-exhaustive)
- *Data Scientist:* modeling, experimentation, insights
- *Data Analyst:* descriptive analytics, dashboards, reporting
- *Machine Learning Engineer:* "productionizing" models
- *Data Engineer:* pipelines, Extract-Transform-Load/Extract-Load-Transform, warehousing
- *Analytics Engineer:* modeling data in the warehouse (e.g., databases)

## **Tools of the Trade**
Here is a non-exhaustive list of tools we will leverage this semester.
- Python 3
    - IPython
    - Jupyter Notebooks
    - NumPy
    - Scipy
    - Pandas
    - Matplotlib
    - scikit-learn (plus sub-packages/modules)
    - statsmodels
- [Conda/Miniconda](https://www.anaconda.com/docs/getting-started/miniconda/main) to manage our environments and packages (more on this).
- [Visual Studio Code](https://code.visualstudio.com/) as an editor/IDE (or you can use the Jupyter Notebook)
- [Git & GitHub](https://github.com/) for version conntrol and collaboration

## **Hands-on Lab: Configuring Our Environment**

#### **Setup Checklist:**
1. Install Visual Studio Code (link above)
2. Install Miniconda (link above)
3. Configure conda to use the conda-forge repository (which is free) versus the "defaults" repository (which is paid). 

    <div style="padding-left: 50px;">

    ```sh
    $ conda config --add channels conda-forge
    $ conda search python
    ```

    </div>

    If all of the listed python packages have "conda-forge" listed in each line, you were successful. If you have anything else ("pkgs/main" is most likely) we need to figure out where conda is pulling the repository information from. The following command will show you the files we'll need to check to make sure other repositories are not listed:

    <div style="padding-left: 50px;">

    ```sh
    $ conda config conda config --show-sources
    ```

    </div>


4. Create a conda environment for this class: 

    <div style="padding-left: 50px;">
        
    ```sh
    % conda create -n cis3803 python=3
    ```

    </div>

5. Activate the new environment:

    <div style="padding-left: 50px;">
        
    ```sh
    % conda activate cis3803
    ```
    
    </div>

6. Install the python packages listed previously:

    <div style="padding-left: 50px;">
        
    ```sh
    % conda install ipython jupyter numpy scipy pandas matplotlib scikit-learn statsmodels
    ```
    
    </div>


7. Configure Git:

    <div style="padding-left: 50px;">

    ```sh
    % git config --global user.name "Your Name"
    % git config --global user.email "you@domain.com"
    ```
    </div>

8. Verify your Python Environment. To do this, open up an interative Python consolde and execute the following:

    <div style="padding-left: 50px;">

    ```python
    % import sys, numpy, pandas, matplotlib
    % print("Python:", sys.version)
    % print("NumPy:", numpy.__version__)
    % print("Pandas:", pandas.__version__)
    % print("Matplotlib:", matplotlib.__version__)
    ```
    </div>
    
    Your output should be something akin to the following:

    <div style="padding-left: 50px;">

    ```sh
    % Python: 3.13.5 | packaged by conda-forge | (main, Jun 16 2025, 08:24:05) [Clang 18.1.8 ]
    % NumPy: 2.3.2
    % Pandas: 2.3.2
    % Matplotlib: 3.10.5
    ```

    </div>


## **Mini EDA Demonstration**

We'll create a small synthetic dataset (pretending to be movie rating information) to practice Pandas operations and a simple visualization. This is designed to get you using our tools. Don't worry if you aren't familiar with the Python packages and commands we'll use. We will cover those in future classes.

**Goals**
- Create a Pandas DataFrame
- Compute summary statistics
- Group, sort, and plot

In [None]:
import pandas as pd
import numpy as np

np.random.seed(42)
n = 200
genres = ["Action", "Comedy", "Drama", "Sci-Fi", "Horror"]
years = np.random.choice(range(2005, 2025), size=n)
data = pd.DataFrame({
    "title": [f"Movie {i:03d}" for i in range(n)],
    "genre": np.random.choice(genres, size=n, p=[0.22, 0.25, 0.28, 0.17, 0.08]),
    "year": years,
    "rating": np.clip(np.random.normal(loc=5.0, scale=2.0, size=n), 1, 10),
    "box_office_millions": np.round(np.random.lognormal(mean=3.2, sigma=0.7, size=n) / 10, 2)
})
data.head()

In [None]:
# Summary statistics
desc = data.describe(include='all')
print(desc)
data.box_office_millions.unique()

In [None]:
# Top genres by average rating
top_genres = (
    data.groupby('genre', as_index=False)
        .agg(avg_rating=('rating','mean'), count=('title','count'))
        .sort_values(['avg_rating','count'], ascending=[False, False])
)
top_genres

In [None]:
import matplotlib.pyplot as plt

plt.figure()
plt.bar(top_genres['genre'], top_genres['avg_rating'])
plt.title('Average Rating by Genre')
plt.xlabel('Genre')
plt.ylabel('Average Rating')
plt.xticks(rotation=30)
plt.show()

### Try It: Quick Challenges
1. Compute the correlation between `rating` and `box_office_millions`.
2. Create a new column `is_blockbuster` for movies with `box_office_millions > 50`.
3. Plot the distribution (histogram) of `rating`.


## **HW 1: Real-World Examples**
**Due:** The start of our second class

**Task**
- Find **2–3 real-world examples** of data science (news article, industry case, open-source project, or personal experience).
- For each example, write 1–2 short paragraphs:
  - What question/problem was addressed?
  - What data was used and how was it collected?
  - What methods/visualizations were applied?
  - What decision or impact followed?
- Add a citation or link to the real-world example.

**Submission**
- Submit as a short PDF or Jupyter Notebook in Canvas. If you used AI assistance, include your prompt(s) and a brief note of how you used it.