## Chapter 2 Data handling

- 충남대학교, 기계학습 기반 서열 분석
- 공유된 파일 `2024-cnu-lecture` 폴더에 복사후 open

#### 구글 드라이브 연결 (세션이 다시 시작 될 때에도 실행 필요)

In [None]:
!pwd

/content


In [None]:
from google.colab import drive
import os

drive.mount('/content/drive')
os.chdir('drive/MyDrive/2024-cnu-lecture')

Mounted at /content/drive


In [None]:
!pwd

/content/drive/MyDrive/2024-cnu-lecture


#### User:

I want to learn "2. Data Handling and Analysis". can you explain me a brief explanation and examples?


#### ChatGPT

Sure! Data handling and analysis are essential skills in Python, especially for tasks like manipulating datasets, performing statistical analyses, and creating visualizations. We'll focus on the key libraries used for these purposes: NumPy, Pandas, and Matplotlib/Seaborn.

### 1. NumPy

NumPy is a powerful library for numerical operations and handling arrays.

#### Basics of NumPy

**Installation:**
```bash
!pip install numpy
```


In [None]:
import numpy as np


# Creating a 1D array
arr1 = np.array([1, 2, 3, 4, 5])

# Creating a 2D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print(arr1)
print(arr2)


# Element-wise operations
arr3 = arr1 * 2
print(arr3)

# Mathematical functions
print(np.sqrt(arr1))


[1 2 3 4 5]
[[1 2 3]
 [4 5 6]]
[ 2  4  6  8 10]
[1.         1.41421356 1.73205081 2.         2.23606798]


# Numpy 자료구조 ndarray

* 행렬이나 다차원 배열 처리용 파이썬 라이브러리
* 같은 타입의 데이터만 허용
* 리스트에 비해 20배 이상 빠른 속도

![from https://www.oreilly.com/library/view/elegant-scipy/9781491922927/ch01.html](https://raw.githubusercontent.com/greendaygh/bioengml/master/numpy.PNG)  

In [None]:
import numpy as np

In [None]:
arr = [1, 2, 3]
print(arr)
print(type(arr))

a = np.array([1,2,3])
print(a)
print(a.dtype)
print(a.shape)
print(type(a))

[1, 2, 3]
<class 'list'>
[1 2 3]
int64
(3,)
<class 'numpy.ndarray'>


In [None]:
arr2 = np.array([[1,2,3], [4,5,6]])
print(arr2)
print(type(arr2))
print(arr2.shape)
print(arr2.dtype)

[[1 2 3]
 [4 5 6]]
<class 'numpy.ndarray'>
(2, 3)
int64


* numpy 자료형
 * 부호가 있는 정수 int(8, 16, 32, 64)
 * 부호가 없는 정수 uint(8 ,16, 32, 54)
 * 실수 float(16, 32, 64, 128)
 * 복소수 complex(64, 128, 256)
 * 불리언 bool
 * 문자열 string_
 * 파이썬 오프젝트 object
 * 유니코드 unicode_

* np.zeros(), np.ones(), np.arange()
* 행렬 연산 지원

In [None]:
a = np.arange(1, 10).reshape(3,3) # [1, 10)
print(a)
a = np.ones((3,4), dtype=np.int16)
b = np.ones((3,4), dtype=np.int16)
print(a)
print(b)
print(a+b)
print(a-b)

[[1 2 3]
 [4 5 6]
 [7 8 9]]
[[1 1 1 1]
 [1 1 1 1]
 [1 1 1 1]]
[[1 1 1 1]
 [1 1 1 1]
 [1 1 1 1]]
[[2 2 2 2]
 [2 2 2 2]
 [2 2 2 2]]
[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]


* numpy 함수
 * np.sqrt()
 * np.log()
 * np.square()
 * np.log()
 * np.ceil()
 * np.floor()
 * np.isnan()
 * np.sum()
 * np.mean()
 * np.std()
 * np.min()

# Pandas 자료구조 (Series, DataFrame)

- Pandas의 Series는 1차원, DataFrame은 2차원 데이터를 다루는 자료구조
- DataFrames are the primary data structure in pandas, representing tabular data with rows and columns.
- 일반적으로 column은 변수, row는 샘플
- 리스트와 딕셔너리의 조합형
- 숫자형, 문자형, 범주형 등의 다양한 데이터 입력 가능

In [None]:
from pandas import Series, DataFrame

In [None]:
genes = Series([0.1, 0.2, 1.4, 0.6, 1.1])
print(genes)

0    0.1
1    0.2
2    1.4
3    0.6
4    1.1
dtype: float64


In [None]:
genes = Series([0.1, 0.2, 1.4, 0.6, 1.1], index=['A', 'B', 'C', 'D', 'E'])
print(genes)

A    0.1
B    0.2
C    1.4
D    0.6
E    1.1
dtype: float64


#### Create from a dictionary

In [None]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Height': [5.5, 6.0, 5.8]
}

df = pd.DataFrame(data)
print(df)


      Name  Age  Height
0    Alice   25     5.5
1      Bob   30     6.0
2  Charlie   35     5.8


#### Create from a list of dictionary

In [None]:
data = [
    {'Name': 'Alice', 'Age': 25, 'Height': 5.5},
    {'Name': 'Bob', 'Age': 30, 'Height': 6.0},
    {'Name': 'Charlie', 'Age': 35, 'Height': 5.8}
]

df = pd.DataFrame(data)
print(df)


      Name  Age  Height
0    Alice   25     5.5
1      Bob   30     6.0
2  Charlie   35     5.8


* 인덱스 자동 정렬, 행렬 연산

In [None]:
genes1 = Series([0.1, 0.2, 1.4, 0.6, 1.1], index=['A', 'B', 'C', 'D', 'E'])
genes2 = Series([0.1, 0.2, 1.4, 0.6, 1.1], index=['B', 'C', 'D', 'E', 'A'])
genes1 + genes2

A    1.2
B    0.3
C    1.6
D    2.0
E    1.7
dtype: float64

In [None]:
print(genes2.sort_values())
print(genes2.sort_index())

B    0.1
C    0.2
E    0.6
A    1.1
D    1.4
dtype: float64
A    1.1
B    0.1
C    0.2
D    1.4
E    0.6
dtype: float64


* DataFrame 생성은 '{', '}' 이용
* DataFrame은 Series의 집합

In [None]:
genes = {'A': [0.5, 0.1, 0.3],
         'B': [0.8, 0.9, 0.4]}
print(genes)
genes_df = DataFrame(genes)
print(genes_df)
print(genes_df['A'])
print(type(genes_df['A']))

{'A': [0.5, 0.1, 0.3], 'B': [0.8, 0.9, 0.4]}
     A    B
0  0.5  0.8
1  0.1  0.9
2  0.3  0.4
0    0.5
1    0.1
2    0.3
Name: A, dtype: float64
<class 'pandas.core.series.Series'>


In [None]:
genes = {'A': [0.5, 0.1, 0.3],
         'B': [0.8, 0.9, 0.4]}
genes_df = DataFrame(genes, columns=['B', 'A'], index=['day1', 'day2', 'day3'])
print(genes_df)

        B    A
day1  0.8  0.5
day2  0.9  0.1
day3  0.4  0.3


In [None]:
print(genes_df['A'])
print(genes_df.loc['day1'])
print(genes_df.index)
print(list(genes_df.columns))

day1    0.5
day2    0.1
day3    0.3
Name: A, dtype: float64
B    0.8
A    0.5
Name: day1, dtype: float64
Index(['day1', 'day2', 'day3'], dtype='object')
['B', 'A']


#### ChatGPT

Sure! Data handling and analysis are essential skills in Python, especially for tasks like manipulating datasets, performing statistical analyses, and creating visualizations. We'll focus on the key libraries used for these purposes: NumPy, Pandas, and Matplotlib/Seaborn.

### 1. NumPy

NumPy is a powerful library for numerical operations and handling arrays.

#### Basics of NumPy

**Installation:**
```bash
pip install numpy
```

**Importing NumPy:**
```python
import numpy as np
```

**Creating Arrays:**
```python
# Creating a 1D array
arr1 = np.array([1, 2, 3, 4, 5])

# Creating a 2D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print(arr1)
print(arr2)
```

**Array Operations:**
```python
# Element-wise operations
arr3 = arr1 * 2
print(arr3)

# Mathematical functions
print(np.sqrt(arr1))
```

### 2. Pandas

Pandas is a powerful library for data manipulation and analysis. It provides data structures like DataFrame, which is similar to tables in databases or spreadsheets.

#### Basics of Pandas

**Installation:**
```bash
pip install pandas
```

**Importing Pandas:**
```python
import pandas as pd
```

**Creating DataFrames:**
```python
# From a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Height': [5.5, 6.0, 5.8]
}

df = pd.DataFrame(data)
print(df)
```

**Reading and Writing Data:**
```python
# Reading from a CSV file
df = pd.read_csv('data.csv')
print(df)

# Writing to a CSV file
df.to_csv('output.csv', index=False)
```

**DataFrame Operations:**
```python
# Selecting columns
ages = df['Age']
print(ages)

# Filtering rows
adults = df[df['Age'] > 18]
print(adults)

# Adding a new column
df['Weight'] = [65, 70, 75]
print(df)

# Grouping and Aggregating
grouped = df.groupby('Name').mean()
print(grouped)
```

### 3. Matplotlib/Seaborn

Matplotlib is a plotting library for creating static, animated, and interactive visualizations. Seaborn is built on top of Matplotlib and provides a high-level interface for drawing attractive statistical graphics.

#### Basics of Matplotlib

**Installation:**
```bash
pip install matplotlib seaborn
```

**Importing Matplotlib and Seaborn:**
```python
import matplotlib.pyplot as plt
import seaborn as sns
```

**Creating Plots:**
```python
# Line plot
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot')
plt.show()

# Scatter plot with Seaborn
sns.scatterplot(x='Age', y='Height', data=df)
plt.title('Age vs Height')
plt.show()
```

### Example: Analyzing and Visualizing DNA Sequence Data

Let's combine these libraries in a practical example. We'll read a CSV file containing DNA sequences and their properties, perform some analysis, and visualize the results.

**Example Data (`dna_data.csv`):**
```csv
Sequence,Length,GC_Content
ATGCGTACGTAGCTAGCTAG,20,50.0
GCGTACGTAGCTAGCTAGCA,20,55.0
TGCATGCATGCTAGCTAGCA,20,45.0
```

**Python Script:**
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Read the CSV file
df = pd.read_csv('dna_data.csv')

# Calculate basic statistics
mean_gc = df['GC_Content'].mean()
std_gc = df['GC_Content'].std()

print(f"Mean GC Content: {mean_gc:.2f}%")
print(f"Standard Deviation of GC Content: {std_gc:.2f}%")

# Visualize the data
plt.figure(figsize=(10, 5))

# Histogram of GC content
plt.subplot(1, 2, 1)
sns.histplot(df['GC_Content'], kde=True)
plt.title('GC Content Distribution')
plt.xlabel('GC Content (%)')

# Boxplot of sequence lengths
plt.subplot(1, 2, 2)
sns.boxplot(x=df['Length'])
plt.title('Sequence Length Distribution')
plt.xlabel('Length (bp)')

plt.tight_layout()
plt.show()
```

### Explanation:
- **Reading Data**: We read the DNA sequence data from a CSV file into a Pandas DataFrame.
- **Calculating Statistics**: We calculate the mean and standard deviation of the GC content using Pandas.
- **Visualizing Data**: We create a histogram of the GC content and a boxplot of the sequence lengths using Matplotlib and Seaborn.

This script demonstrates the integration of data handling, analysis, and visualization in Python. By learning and practicing these concepts, you'll be well-equipped to handle and analyze biological data.

#### ChatGPT

Sure! Data handling and analysis are essential skills in Python, especially for tasks like manipulating datasets, performing statistical analyses, and creating visualizations. We'll focus on the key libraries used for these purposes: NumPy, Pandas, and Matplotlib/Seaborn.

### 1. NumPy

NumPy is a powerful library for numerical operations and handling arrays.

#### Basics of NumPy

**Installation:**
```bash
pip install numpy
```

**Importing NumPy:**
```python
import numpy as np
```

**Creating Arrays:**
```python
# Creating a 1D array
arr1 = np.array([1, 2, 3, 4, 5])

# Creating a 2D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print(arr1)
print(arr2)
```

**Array Operations:**
```python
# Element-wise operations
arr3 = arr1 * 2
print(arr3)

# Mathematical functions
print(np.sqrt(arr1))
```

### 2. Pandas

Pandas is a powerful library for data manipulation and analysis. It provides data structures like DataFrame, which is similar to tables in databases or spreadsheets.

#### Basics of Pandas

**Installation:**
```bash
pip install pandas
```

**Importing Pandas:**
```python
import pandas as pd
```

**Creating DataFrames:**
```python
# From a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Height': [5.5, 6.0, 5.8]
}

df = pd.DataFrame(data)
print(df)
```

**Reading and Writing Data:**
```python
# Reading from a CSV file
df = pd.read_csv('data.csv')
print(df)

# Writing to a CSV file
df.to_csv('output.csv', index=False)
```

**DataFrame Operations:**
```python
# Selecting columns
ages = df['Age']
print(ages)

# Filtering rows
adults = df[df['Age'] > 18]
print(adults)

# Adding a new column
df['Weight'] = [65, 70, 75]
print(df)

# Grouping and Aggregating
grouped = df.groupby('Name').mean()
print(grouped)
```

### 3. Matplotlib/Seaborn

Matplotlib is a plotting library for creating static, animated, and interactive visualizations. Seaborn is built on top of Matplotlib and provides a high-level interface for drawing attractive statistical graphics.

#### Basics of Matplotlib

**Installation:**
```bash
pip install matplotlib seaborn
```

**Importing Matplotlib and Seaborn:**
```python
import matplotlib.pyplot as plt
import seaborn as sns
```

**Creating Plots:**
```python
# Line plot
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot')
plt.show()

# Scatter plot with Seaborn
sns.scatterplot(x='Age', y='Height', data=df)
plt.title('Age vs Height')
plt.show()
```

### Example: Analyzing and Visualizing DNA Sequence Data

Let's combine these libraries in a practical example. We'll read a CSV file containing DNA sequences and their properties, perform some analysis, and visualize the results.

**Example Data (`dna_data.csv`):**
```csv
Sequence,Length,GC_Content
ATGCGTACGTAGCTAGCTAG,20,50.0
GCGTACGTAGCTAGCTAGCA,20,55.0
TGCATGCATGCTAGCTAGCA,20,45.0
```

**Python Script:**
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Read the CSV file
df = pd.read_csv('dna_data.csv')

# Calculate basic statistics
mean_gc = df['GC_Content'].mean()
std_gc = df['GC_Content'].std()

print(f"Mean GC Content: {mean_gc:.2f}%")
print(f"Standard Deviation of GC Content: {std_gc:.2f}%")

# Visualize the data
plt.figure(figsize=(10, 5))

# Histogram of GC content
plt.subplot(1, 2, 1)
sns.histplot(df['GC_Content'], kde=True)
plt.title('GC Content Distribution')
plt.xlabel('GC Content (%)')

# Boxplot of sequence lengths
plt.subplot(1, 2, 2)
sns.boxplot(x=df['Length'])
plt.title('Sequence Length Distribution')
plt.xlabel('Length (bp)')

plt.tight_layout()
plt.show()
```

### Explanation:
- **Reading Data**: We read the DNA sequence data from a CSV file into a Pandas DataFrame.
- **Calculating Statistics**: We calculate the mean and standard deviation of the GC content using Pandas.
- **Visualizing Data**: We create a histogram of the GC content and a boxplot of the sequence lengths using Matplotlib and Seaborn.

This script demonstrates the integration of data handling, analysis, and visualization in Python. By learning and practicing these concepts, you'll be well-equipped to handle and analyze biological data.