# Course7: Data Analysis with Python

## Module1: Importing Data Sets

To perform data analysis in Python effectively, several key libraries are essential. These libraries can be categorized into three main groups: Scientific Computing Libraries, Data Visualization Libraries, and Algorithmic Libraries.

### 1. Scientific Computing Libraries
- **Pandas:** Provides data structures and tools for efficient data manipulation and analysis. Its primary structure, the DataFrame, allows for easy indexing and access to structured data.
- **NumPy:** Utilizes arrays for inputs and outputs, enabling fast array processing. It can be extended for matrix operations with minimal coding changes.
- **SciPy:** Offers functions for advanced mathematical problems and data visualization.

### 2. Data Visualization Libraries
- **Matplotlib:** The most well-known library for creating graphs and plots. It allows for highly customizable visual representations of data.
- **Seaborn:** Built on Matplotlib, this library makes it easy to generate a variety of plots, including heat maps, time series, and violin plots.

### 3. Algorithmic Libraries
- **Scikit-learn:** Contains tools for statistical modeling, such as regression, classification, and clustering. It is built on NumPy, SciPy, and Matplotlib.
- **Statsmodels:** Enables users to explore data, estimate statistical models, and perform statistical tests.

These libraries collectively offer a comprehensive toolkit for data manipulation, analysis, visualization, and the development of machine learning models in Python.



# To Read and Manipulate Data in Python Using the Pandas Package

## Data Acquisition with Pandas [การอ่าน Data โดยใช้ Pandas]

### 1. Understanding Data Format and Path
- **Format:** The way data is encoded, such as CSV, JSON, XLSX, HDF, etc.
- **Path:** Location where the data is stored, either locally or online.

### Steps to Read Data Using Pandas

### 2. Reading CSV Files
- **Example Dataset:** We have a dataset of used cars stored online, where each row represents one data point with various properties separated by commas, indicating a CSV format.

### 3. Code to Read CSV Data
- **Import the Pandas library:**
```python
import pandas as pd

- **Define a variable with the file path:**
```python
file_path = 'http://example.com/used_cars.csv'
```

- **Use the `read_csv` method to read the data into a Pandas DataFrame:**
```python
df = pd.read_csv(file_path)
```

### 4. Handling Data Without Headers [set header = None]
- If the dataset has no column headers, set the `header` parameter to `None` in the `read_csv` method.
```python
df = pd.read_csv(file_path, header=None)
```

### 5. Previewing Data
- **Use `df.head(n)` to display the first `n` rows:**
```python
print(df.head(5))  # First 5 rows
```

- **Use `df.tail(n)` to display the last `n` rows:**
```python
print(df.tail(5))  # Last 5 rows
```

### 6. Assigning Column Names [if we want to add column head name]
- If column names are in a separate file or list, assign them to the DataFrame.
```python
headers = ['column1', 'column2', 'column3']  # Example column names
df.columns = headers
```

### 7. Exporting Data to CSV
- Use the `to_csv` method to save the DataFrame to a new CSV file.
```python
df.to_csv('path/to/automobile.csv', index=False)
```
```

คัดลอกและวางโค้ดข้างต้นลงในเซลล์ Markdown ของ Jupyter Notebook และรันเซลล์เพื่อดูผลลัพธ์ที่ได้.

# Lesson Summary

Congratulations! You have completed this lesson. At this point in the course, you know:

- **Dataset Structure:**
  - Each line in a dataset is a row, and commas separate the values.
  - To understand the data, you must analyze the attributes for each column of data.

- **Python Libraries:**
  - Python libraries are collections of functions and methods that facilitate various functionalities without writing code from scratch and are categorized into Scientific Computing, Data Visualization, and Machine Learning Algorithms.
  - Many data science libraries are interconnected; for instance, Scikit-learn is built on top of NumPy, SciPy, and Matplotlib.

- **Reading Data with Pandas:**
  - The data format and the file path are two key factors for reading data with Pandas.
  - The `read_csv` method in Pandas can read files in CSV format into a Pandas DataFrame.

- **Data Types in Pandas:**
  - Pandas has unique data types like object, float, int, and datetime.
  - Use the `dtype` method to check each column’s data type; misclassified data types might need manual correction.
  - Knowing the correct data types helps apply appropriate Python functions to specific columns.

- **Statistical Summary with `describe()`:**
  - Using the `describe()` method provides count, mean, standard deviation, min, max, and quartile ranges for numerical columns.
  - You can also use `include='all'` as an argument to get summaries for object-type columns.
  - The statistical summary helps identify potential issues like outliers needing further attention.

- **Overview with `info()`:**
  - Using the `info()` method gives an overview of the top and bottom 30 rows of the DataFrame, useful for quick visual inspection.
  - Some statistical metrics may return "NaN," indicating missing values, and the program can’t calculate statistics for that specific data type.

- **Database Connections:**
  - Python can connect to databases through specialized code, often written in Jupyter notebooks.
  - SQL Application Programming Interfaces (APIs) and Python DB APIs (most often used) facilitate the interaction between Python and the DBMS.
  - SQL APIs connect to DBMS with one or more API calls, build SQL statements as a text string, and use API calls to send SQL statements to the DBMS and retrieve results and statuses.
  - DB-API, Python's standard for interacting with relational databases, uses connection objects to establish and manage database connections and cursor objects to run queries and scroll through the results.
  - Connection Object methods include the `cursor()`, `commit()`, `rollback()`, and `close()` commands.
  - You can import the database module, use the Connect API to open a connection, and then create a cursor object to run queries and fetch results.
  - Remember to close the database connection to free up resources.

### Quiz Module 1:

#### Question 1:
Which of the following commands would you use to retrieve the concise summary of a dataset loaded as pandas data frame `df`?

```python
df.info()
```
This command provides a concise summary of the DataFrame, including the number of non-null entries, column data types, and memory usage.

#### Question 2:
What description best describes the library Pandas?

Offers data structure and tools for effective data manipulation and analysis. It provides fast access to structured data.

Pandas is known for its powerful data structures (like DataFrames) and tools for data manipulation and analysis.

#### Question 3:
What task does the following code perform?

```python
path='C:\Windows\…\automobile.csv'
df.to_csv(path)
```

Exports your Pandas data frame to a new CSV file in the location specified by the variable path.

The `to_csv` method is used to export a DataFrame to a CSV file at the specified path.

#### Question 4:
How would you use the `describe()` method with a data frame `df` to get a statistical summary of only the columns with numerical values in the data frame?

```python
df.describe()
```

By default, `df.describe()` returns the statistical summary of only the numerical columns in the DataFrame.

#### Question 5:
What Python library is primarily used for machine learning?

- scikit-learn

#### Question 6:
We have the list `headers_list=['A','B','C']`. We also have the data frame `df` that contains three columns. What syntax should you use to replace the headers of the data frame `df` with values in the list `headers_list`?

- `df.columns = headers_list`

#### Question 7:
What task does the following command perform? `df = pandas.read_csv("A.csv")`

- Loads the data from a CSV file called "A.csv" into a data frame ‘df’

#### Question 8:
Consider the segment of the following data frame:

Table where the column with a make header includes values "audi" and "alfa-romero"

What is the type of attribute “make”?

- object

#### Question 9:
How do you generate descriptive statistics for all the columns for the data frame `df`?

- `df.describe(include="all")`
```