## Dealing with data

Data is made up of variables and can classify data into two basic groups:

- Numeric: numbers, either whole or fractions
     - Discrete: Only certain values possible within a given range (e.g. counts)
     - Continuous: Any value is possible within a given range (e.g. measurements)

- Categorical (factors): identifies a grouping of data

Both numeric data is collected through experimental and observational studies
and most data is recorded electronically as either Excel spreadsheets (.xlsx)
or as comma-separated value sheets (.csv) where all values are separated
from each other with a comma.

Example csv format:

| variable_1, | variable_2, | group, |
|------------|------------|-------|
| 10,         | 15,         | 1,     |
| 11,        | 13,         | 1,     |
| 12,         | 11,         | 1,     |
| 14,         | 11,         | 1,     |
| 12,         | 11,         | 2,     |
| 11,         | 11,         | 2,     |
| 15,         | 13,         | 2,     |
| 16,         | 12,         | 2,     |

Notice the first row contains a variable name for each column and then each other row
is a set of measurements (numeric) or identifiers (factors).

- Each row represents a single observation of a particular set of variables.
- Each column represents all observations of a particular variable.

Important concepts
- The entire table together is a data frame
- Each column represents a single vector or array of data
- Data in columns can represent numerical data or be made up of factors, even if the arrays are made up of numbers
- We need to explicitly indicate which variables are factors in our code

## Exercise: Importing data


In Python, you can use various libraries to import data. For example, to import
the Iris dataset, you can use the following code with the pandas library:

In [3]:
import seaborn as sns

# Load the Iris dataset
iris = sns.load_dataset("iris")

# Display the first few rows of the dataset
print(iris.head())

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


.csv files (from the internet)

In [2]:
# To import data from a .csv file, you can use pandas as well. For example:

import pandas as pd

url = "https://data.cdc.gov/resource/9bhg-hcku.csv"
covid_data = pd.read_csv(url)
print(covid_data.head())

                data_as_of               start_date                 end_date  \
0  2023-09-13T00:00:00.000  2020-01-01T00:00:00.000  2023-09-09T00:00:00.000   
1  2023-09-13T00:00:00.000  2020-01-01T00:00:00.000  2023-09-09T00:00:00.000   
2  2023-09-13T00:00:00.000  2020-01-01T00:00:00.000  2023-09-09T00:00:00.000   
3  2023-09-13T00:00:00.000  2020-01-01T00:00:00.000  2023-09-09T00:00:00.000   
4  2023-09-13T00:00:00.000  2020-01-01T00:00:00.000  2023-09-09T00:00:00.000   

      group  year  month          state        sex     age_group  \
0  By Total   NaN    NaN  United States  All Sexes      All Ages   
1  By Total   NaN    NaN  United States  All Sexes  Under 1 year   
2  By Total   NaN    NaN  United States  All Sexes    0-17 years   
3  By Total   NaN    NaN  United States  All Sexes     1-4 years   
4  By Total   NaN    NaN  United States  All Sexes    5-14 years   

   covid_19_deaths  total_deaths  pneumonia_deaths  \
0        1144031.0      12183330         1155816.0   
1 

In [None]:
.xlsx files (from the internet)

In [10]:
# To import data from an .xlsx file, you can use the openpyxl library. For example:

import pandas as pd
import requests

# URL of the Excel file
url = "https://www.w3resource.com/python-exercises/pandas/excel/SaleData.xlsx"

# Download the Excel file
response = requests.get(url)

# Save the Excel file locally
with open("SaleData.xlsx", "wb") as file:
    file.write(response.content)

# Load the Excel data into a Pandas DataFrame
df = pd.read_excel("SaleData.xlsx")

# Display the first few rows of the DataFrame
print(df.head())

   OrderDate   Region  Manager   SalesMan          Item  Units  Unit_price  \
0 2018-01-06     East   Martha  Alexander    Television   95.0      1198.0   
1 2018-01-23  Central  Hermann     Shelli  Home Theater   50.0       500.0   
2 2018-02-09  Central  Hermann       Luis    Television   36.0      1198.0   
3 2018-02-26  Central  Timothy      David    Cell Phone   27.0       225.0   
4 2018-03-15     West  Timothy    Stephen    Television   56.0      1198.0   

   Sale_amt  
0  113810.0  
1   25000.0  
2   43128.0  
3    6075.0  
4   67088.0  


In [None]:
If the files are on local storage

In [None]:
# If the data is in .csv format, you can use pandas to read it:

import pandas as pd

penguin_data = pd.read_csv("palmer_penguins.csv")
print(penguin_data.head())

# If the data is in .xlsx format, you can use openpyxl as shown earlier for web-based .xlsx files.

There may be other data file formats out there (e.g., .txt), and you can search
on Google how to import those.

## Exercise exploring data

We can explore the data using functions in Python

In [None]:
# Explore built-in data
print(dir())
print(iris.head())
print(iris.tail())

# Determine if a particular variable is numeric or categorical
print(type(iris['class']))
print(type(iris['sepal_length']))