In [None]:
## Begin Setup
import pandas as pd
import matplotlib.pyplot as plt
## End Setup

# Data Basics

<center><img src="../images/stock/pexels-goumbik-577210.jpg" width="500"></center>

## Overview

Data comes in many flavors:

* Stock prices.
* Signals from sensors.
* Makes and models of cars.
* Movie Reviews
* And the list goes on

No matter where it comes from, or what kind of data it is, you're going to learn how to work with all sorts of data. 

In this notebook, we will cover:

* The main types of data.
* Where data comes from.
* The general steps (or the data pipeline) to:
    * Get the data.
    * Clean the data.
    * Find interesting patterns in it.

## Data Categories
<center><img src="../images/stock/pexels-19x14-8478408.jpg" width="500"></center>

Data can be divided into three main categories:

* __Unstructured__: This is often the rawest form of data.
* __Structured__: Data organized in a specific format, like tables.
* __Semi-structured__: Data that has some organizational properties but isn't fully structured like a table.

It's worth noting that in the data processing pipeline, the initial source of data is frequently unstructured.

### Unstructured Data

Unstructured data is information that doesn't fit neatly into a predefined organizational system or schema. It's one of the most common types of data we encounter and includes things like:

* Images
* Videos
* Audio recordings
* Natural language text (like emails, documents, and social media posts)

The reason it's called unstructured is that the information within these formats isn't arranged according to a fixed structure; instead, it's more or less randomly distributed.

#### Example: Unstructured PSU Data

The code cell below will open a file containing information about Portland State University:

In [None]:
## Example
text = "../data/psu.txt"

try:
    file = open(text, "r")
    content = file.read()
    print(content)
    file.close()
except FileNotFoundError:
    print(f"Error: The file '{text}' was not found.")
except Exception as e:
    print(f"An error occurred: {e}")

### Structured Data

Structured data has a clear, predefined format that dictates how the data is organized. Think of it like information neatly arranged in a specific layout. This type of data is typically stored in:

* Databases
* CSV (Comma Separated Values) files
* Other organized repositories

There are two basic types of structured data:

* __Numerical__: This data expresses information in number form, allowing for mathematical operations (like addition, subtraction, averages, etc.).
* __Categorical__: This data can be sorted into categories based on similar characteristics.
    * For example, cars can be categorized by their make (Toyota, Mazda, Honda) or model (Camry, RX7, Civic).

It's interesting to note that some data might look numerical but is actually categorical. Think of zip codes or phone numbers. While they are made up of numbers, we usually don't perform mathematical calculations on them (like finding the average zip code). Instead, they serve as labels or identifiers.

#### Example: Structured PSU Data

The code cell below will open a structured file containing PSU data:

In [None]:
## Example
# Import Data
file = "../data/psu.csv"
df = pd.read_csv(file)

# Show the Data
df.head()

### Semi-Structured Data

Alright, let's move on to semi-structured data. This type of data doesn't adhere to a rigid formatting requirement like structured data. However, it does have some organizational properties that make it easier to work with than completely unstructured data.

Think of semi-structured data as having some hints of structure, like self-describing tags or other markers that help identify different pieces of information.

The most common semi-structured data formats you'll encounter are:

* __XML__ (Extensible Markup Language)
* __JSON__ (JavaScript Object Notation)

#### Example: JSON File

The code cell below will load a JSON file containing data on several institutions.

In [None]:
## Example
file = "../data/college_data.json"
college_df = pd.read_json(file)

college_df.head()

### Time Series Data

Now, let's discuss time series data. This is essentially a collection of data points that are recorded or listed in a specific time order.

You'll often encounter time series data in areas like financial datasets, where each piece of information (like a stock price) is associated with a particular point in time.

__Note:__

Time series data can be either structured or semi-structured. 

Imagine you're tracking posts related to a specific hashtag over time. Each post has a timestamp. However, the content of each post (text, included media, user mentions, hashtags) varies greatly. While the timestamp provides the time series element, the post content itself is semi-structured.



#### Example: Open Power System Data Time Series

The following code cell examines time series data from Open Power System Data, covering energy-related information for Germany in 2006-2007.

In essence, this dataset provides a daily snapshot of energy-related information, showing how much electricity was used and how much was generated from wind and solar sources:

In [None]:
## Example
file = "../data/opsd_germany_daily.csv"
open_power_df = pd.read_csv(file, 
                            index_col=0,
                            parse_dates=True)

open_power_df.loc['2015-02']

## Sources of Data

<center><img src="../images/stock/pexels-ps-photography-14694-67184.jpg" width="500"></center>
Now, let's touch on where we typically get our data. Data originates from a huge variety of places – think text documents, videos, games, and sensors constantly collecting information.

However, for what we'll be doing in this class, the most common data sources you'll encounter are usually one of these:

* __APIs (Application Programming Interfaces)__: These allow different software systems to communicate and share data.
* __Web pages__: Information publicly available on the internet.
* __Online databases__: Organized collections of data accessible over a network.
* __Files__: Data stored in various formats like CSV, Excel, JSON, etc.

There are many other fascinating ways to get data, but for our purposes, these will be the main ones we'll focus on. We just don't have the time to explore every single possibility!

## The Data Pipeline

<center><img src="../images/stock/pexels-spacex-586019.jpg" width="500"></center>

The Data Processing Pipeline outlines the typical journey of data, and it usually involves these key steps:

* __Acquisition__: Getting the data in the first place.
* __Cleaning__: Detecting, correcting, or removing flawed or unnecessary data.
* __Transformation__: Changing the format or structure of the data for analysis.
* __Analysis__: Interpreting the data and drawing meaningful conclusions.
* __Storage__: Keeping the results for future use.

It's important to remember that this isn't always a rigid sequence, and sometimes steps might be skipped or blended together.

Let's break down each step

### Acquisition

This is the initial step where we obtain the data. 

As we did in class, using the Yahoo Finance API to pull the stock ticker for Tesla is a perfect example of data acquisition.

In the code cell below is another example of data acquistion--where the `BeautifulSoup` library is being used to scrape data from the [Portland State University Wiki Page](https://en.wikipedia.org/wiki/Portland_State_University).

In [None]:
## Example
from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/Portland_State_University"

try:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the first <h1> tag, which usually contains the main title
    title_tag = soup.find('h1')
    if title_tag:
        print(title_tag.text)
        # Find the first non-empty paragraph
        first_paragraph = None
        for p in soup.find_all('p'):
            if p.text.strip():  # Check if paragraph has content
                first_paragraph = p
                break
        if first_paragraph:
            print(f"\nFirst Paragraph:\n{first_paragraph.text[:200]}...")  # Limit to 200 chars
        else:
            print("\nCould not find a non-empty first paragraph.")
    else:
        print("Could not find the title on the page.")

except requests.exceptions.RequestException as e:
    print(f"Error: {e}")

### Cleaning
Cleaning: Once we have the data, we often need to clean it. This involves identifying and fixing (or removing) any data that is incorrect, incomplete, or irrelevant. 

While the data we get from Yahoo Finance is often quite clean and ready to use in a DataFrame, other sources might require significant cleaning. 

For instance, data from Nike's BikeTown system might have inaccuracies that need to be addressed. 

Data cleaning can also involve filtering out specific columns or rows if they aren't needed for our analysis, helping us focus on the most relevant information.

#### Cleaning Example

In pandas, `df.dropna()` is a method used to remove missing values (represented as `NaN`) from a DataFrame or Series.

That `NaN` value in the College DataFrame was bothering me. How about we drop that row in the cell below:

In [None]:
## Example
cleaned_college_df = college_df.dropna()
cleaned_college_df

Alternatively, `df.fillna()` is a method used to replace missing values `NaN` with a specified value.

In [None]:
## Example
cleaned_college_df = college_df.fillna(0)
cleaned_college_df

### Transformation

After cleaning, we might need to transform the data into a more suitable format for analysis. This could involve:

* Converting text-based data into numerical representations.
* Creating new columns based on calculations or combinations of existing columns.

#### Example

The code cell below transforms the College DataFrame by adding a new column that calculates the estimated per-term tuition cost.

In [None]:
## Example
per_term = round(cleaned_college_df["in-state tuition"] / 3)
cleaned_college_df["in-state tuition per term"] = per_term

new_order = ["name",
             "in-state tuition",
             "in-state tuition per term",
             "undergrad enrollment",
             "acceptance rate"
            ]

cleaned_college_df = cleaned_college_df[new_order]
cleaned_college_df

### Analysis

This is where the magic happens! We interpret the processed data to uncover insights and draw conclusions that weren't obvious in the raw data. This often involves:

* Visualizing data using plots and graphs to identify patterns.
* Calculating statistical summaries (like mean, median, standard deviation) to understand the data's characteristics.

#### Example: In-State Tuition

The code cell below uses `matplotlib` to generate a bar chart visualizing in-state tuition, using data from the College DataFrame.

In [None]:
## Example
# Create a bar chart of In-State Tuition for each university
plt.figure(figsize=(8, 5)) 
plt.bar(cleaned_college_df['name'], cleaned_college_df['in-state tuition'])

# Add labels and title
plt.xlabel('University')
plt.ylabel('In-State Tuition (USD)')
plt.title('In-State Tuition for Oregon Universities')
plt.xticks(rotation=15)  

# Display the plot
plt.show()

### Storage

Finally, we need a way to store the results of our analysis so we can access and use them later. This often involves saving the processed data or our findings to files in various formats.

#### Pandas: Exporting Files

Pandas DataFrames can be exported to various file formats using these functions:

* __CSV (`.csv`)__: Use `df.to_csv()` for comma-separated values, a widely compatible format.
* __Excel (`.xlsx`)__: Use `df.to_excel()` for Microsoft Excel files.
* __TSV (`.tsv`)__: Use `df.to_csv()` with `sep='\t'` for tab-separated values, useful when data contains commas.
* __JSON (`.json`)__: Use `df.to_json()` for JavaScript Object Notation, a format common in web applications.

In [None]:
## Example

# 1. Export to CSV
csv_filename = '../data/exports/college_data.csv'
# index=False prevents writing the DataFrame index to the file
cleaned_college_df.to_csv(csv_filename, index=False)  
print(f"DataFrame exported to: {csv_filename}")

# 2. Export to Excel
excel_filename = '../data/exports/college_data.xlsx'
cleaned_college_df.to_excel(excel_filename, index=False)
print(f"DataFrame exported to: {excel_filename}")

# 3. Export to TSV (Tab-separated values)
tsv_filename = '../data/exports/college_data.tsv'
cleaned_college_df.to_csv(tsv_filename, sep='\t', index=False)
print(f"DataFrame exported to: {tsv_filename}")

# 4. Export to JSON
json_filename = '../data/exports/college_data.json'
# orient='records' creates a list of dictionaries
cleaned_college_df.to_json(json_filename, orient='records')  
print(f"DataFrame exported to: {json_filename}")