In [None]:
...
Trong **pandas**, đối tượng `DataFrame (df)` có rất nhiều hàm (method) hay dùng để thao tác, phân tích và xử lý dữ liệu. Dưới đây là danh sách **các hàm quan trọng và hay dùng nhất** kèm ví dụ ngắn:

---

### 1. Khám phá dữ liệu

* `df.head(n)` → Xem **n dòng đầu** (mặc định 5).
* `df.tail(n)` → Xem **n dòng cuối**.
* `df.shape` → Trả về `(số_dòng, số_cột)`.
* `df.info()` → Thông tin tổng quan: kiểu dữ liệu, số giá trị non-null.
* `df.describe()` → Thống kê mô tả.
* `df.columns` → Danh sách tên cột.
* `df.index` → Trả về chỉ số dòng.
* `df.dtypes` → Kiểu dữ liệu của từng cột.

---

### 2. Lọc, chọn dữ liệu

* `df['col']` → Lấy 1 cột.
* `df[['col1','col2']]` → Lấy nhiều cột.
* `df.iloc[0:5, 0:2]` → Lấy theo **chỉ số** (hàng, cột).
* `df.loc[0:5, ['col1','col2']]` → Lấy theo **tên nhãn**.
* `df[df['Age'] > 20]` → Lọc theo điều kiện.

---

### 3. Xử lý dữ liệu thiếu

* `df.isnull()` → Kiểm tra giá trị thiếu (NaN).
* `df.isnull().sum()` → Đếm số giá trị thiếu theo cột.
* `df.fillna(value)` → Điền giá trị thay thế.
* `df.dropna()` → Xóa hàng có giá trị thiếu.

---

### 4. Thống kê & toán học

* `df.mean()`, `df.median()`, `df.mode()` → Trung bình, trung vị, mode.
* `df.min()`, `df.max()` → Giá trị nhỏ nhất / lớn nhất.
* `df.sum()`, `df.cumsum()` → Tổng / tích lũy.
* `df.value_counts()` → Đếm tần suất xuất hiện (thường dùng cho 1 cột).

---

### 5. Biến đổi dữ liệu

* `df.rename(columns={'old':'new'})` → Đổi tên cột.
* `df.drop(columns=['col'])` → Xóa cột.
* `df.sort_values(by='col', ascending=True)` → Sắp xếp.
* `df.apply(func)` → Áp dụng hàm cho từng phần tử.
* `df['col'].map(func)` → Biến đổi từng giá trị trong 1 cột.

---

### 6. Gom nhóm & tổng hợp

* `df.groupby('col').mean()` → Nhóm theo cột rồi tính trung bình.
* `df.groupby('col').agg({'col2':'sum', 'col3':'mean'})` → Gom nhóm nhiều phép tính.
* `df.pivot_table(values='col1', index='col2', columns='col3', aggfunc='mean')` → Tạo pivot table.

---

### 7. Kết hợp dữ liệu

* `pd.concat([df1, df2])` → Nối theo hàng hoặc cột.
* `pd.merge(df1, df2, on='col')` → Join 2 bảng theo khóa.
* `df.join(df2)` → Join theo index.

---

### 8. Xuất / nhập dữ liệu

* `pd.read_csv('file.csv')` → Đọc file CSV.
* `df.to_csv('file.csv', index=False)` → Xuất ra CSV.
* `pd.read_excel('file.xlsx')`, `df.to_excel('file.xlsx')` → Excel.

---

Bạn có muốn tôi lập một **bảng tổng hợp ngắn gọn (hàm – chức năng – ví dụ)** để dễ tra cứu nhanh không?

...

In [5]:
import pandas as pd 
import numpy as np 


In [10]:
df = pd.read_csv("chipolet.csv", sep= "\t")


In [26]:
df.head(5) #return 5 dong dau tien trong dataset

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [27]:
#dataset over view: 4622 , 5 dong  
df.shape


(4622, 5)

In [28]:
df.info()  #thong tin ve dataset 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   order_id            4622 non-null   int64 
 1   quantity            4622 non-null   int64 
 2   item_name           4622 non-null   object
 3   choice_description  3376 non-null   object
 4   item_price          4622 non-null   object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB


In [29]:
list(df.columns )#lay ten cot trong dataset 

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']

In [30]:
df.index #lay index cua dataset

RangeIndex(start=0, stop=4622, step=1)

In [31]:
df.describe() #là một phương thức dùng để tóm tắt thống kê mô tả của data frame 
# o day thi cac cot da thong ke la kieu int no moi tinh duoc so lieu khong tra cac cot ma la kieu 
#khac int  
#  

Unnamed: 0,order_id,quantity
count,4622.0,4622.0
mean,927.254868,1.075725
std,528.890796,0.410186
min,1.0,1.0
25%,477.25,1.0
50%,926.0,1.0
75%,1393.0,1.0
max,1834.0,15.0


In [32]:
df.describe(include = "all") 
#thong ke tat cac ca string 

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
count,4622.0,4622.0,4622,3376,4622
unique,,,50,1043,78
top,,,Chicken Bowl,[Diet Coke],$8.75
freq,,,726,134,730
mean,927.254868,1.075725,,,
std,528.890796,0.410186,,,
min,1.0,1.0,,,
25%,477.25,1.0,,,
50%,926.0,1.0,,,
75%,1393.0,1.0,,,


### loc vs iloc 


In [22]:
df.head() 

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [23]:
df.tail

<bound method NDFrame.tail of       order_id  quantity                              item_name  \
0            1         1           Chips and Fresh Tomato Salsa   
1            1         1                                   Izze   
2            1         1                       Nantucket Nectar   
3            1         1  Chips and Tomatillo-Green Chili Salsa   
4            2         2                           Chicken Bowl   
...        ...       ...                                    ...   
4617      1833         1                          Steak Burrito   
4618      1833         1                          Steak Burrito   
4619      1834         1                     Chicken Salad Bowl   
4620      1834         1                     Chicken Salad Bowl   
4621      1834         1                     Chicken Salad Bowl   

                                     choice_description item_price  
0                                                   NaN     $2.39   
1                          

In [None]:
df.dtypes #coi du lieu tung cot 

order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object

In [12]:
df.head 

<bound method NDFrame.head of       order_id  quantity                              item_name  \
0            1         1           Chips and Fresh Tomato Salsa   
1            1         1                                   Izze   
2            1         1                       Nantucket Nectar   
3            1         1  Chips and Tomatillo-Green Chili Salsa   
4            2         2                           Chicken Bowl   
...        ...       ...                                    ...   
4617      1833         1                          Steak Burrito   
4618      1833         1                          Steak Burrito   
4619      1834         1                     Chicken Salad Bowl   
4620      1834         1                     Chicken Salad Bowl   
4621      1834         1                     Chicken Salad Bowl   

                                     choice_description item_price  
0                                                   NaN     $2.39   
1                          

In [33]:
df.loc[(df.quantity == 15) & (df.item_name == "Chips and Fresh Tomato Salsa"), ['order_id', 'quantity', 'item_name']]

Unnamed: 0,order_id,quantity,item_name
1882,759,2,Chips and Fresh Tomato Salsa
2267,912,2,Chips and Fresh Tomato Salsa
2729,1083,2,Chips and Fresh Tomato Salsa


In [42]:
df.iloc[[11]]


Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
11,6,1,Chicken Crispy Tacos,"[Roasted Chili Corn Salsa, [Fajita Vegetables,...",$8.75


In [51]:
df.iloc[[10]] 

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
10,5,1,Chips and Guacamole,,$4.45


In [53]:
df.iloc[3:11,0:-1] 

Unnamed: 0,order_id,quantity,item_name,choice_description
3,1,1,Chips and Tomatillo-Green Chili Salsa,
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans..."
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou..."
6,3,1,Side of Chips,
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables..."
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch..."
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto..."
10,5,1,Chips and Guacamole,


In [65]:
df.iloc[3:11, :-1]

Unnamed: 0,order_id,quantity,item_name,choice_description
3,1,1,Chips and Tomatillo-Green Chili Salsa,
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans..."
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou..."
6,3,1,Side of Chips,
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables..."
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch..."
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto..."
10,5,1,Chips and Guacamole,


In [82]:
x= df.iloc[3:11]
print(x)


    order_id  quantity                              item_name  \
3          1         1  Chips and Tomatillo-Green Chili Salsa   
4          2         2                           Chicken Bowl   
5          3         1                           Chicken Bowl   
6          3         1                          Side of Chips   
7          4         1                          Steak Burrito   
8          4         1                       Steak Soft Tacos   
9          5         1                          Steak Burrito   
10         5         1                    Chips and Guacamole   

                                   choice_description item_price  
3                                                 NaN     $2.39   
4   [Tomatillo-Red Chili Salsa (Hot), [Black Beans...    $16.98   
5   [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...    $10.98   
6                                                 NaN     $1.69   
7   [Tomatillo Red Chili Salsa, [Fajita Vegetables...    $11.75   
8   [Tomatil

In [91]:
df.quantity.dtype

dtype('int64')

In [108]:
df.item_price.apply(lambda x:(x.replace('$', '')))

0        2.39 
1        3.39 
2        3.39 
3        2.39 
4       16.98 
         ...  
4617    11.75 
4618    11.75 
4619    11.25 
4620     8.75 
4621     8.75 
Name: item_price, Length: 4622, dtype: object

In [137]:
df["total_price"] = df["quantity"]*df["item_price"]
#tao mot hang moi la total_price "

In [148]:
#tinh doanh thu 
revenue = df["total_price"].sum()
print(revenue)

$2.39 $3.39 $3.39 $2.39 $16.98 $16.98 $10.98 $1.69 $11.75 $9.25 $9.25 $4.45 $8.75 $8.75 $11.25 $4.45 $2.39 $8.49 $8.49 $2.18 $2.18 $8.75 $4.45 $8.99 $3.39 $10.98 $3.39 $2.39 $8.49 $8.99 $1.09 $8.49 $2.39 $8.99 $1.69 $8.99 $1.09 $8.75 $8.75 $4.45 $2.95 $11.75 $2.15 $4.45 $11.25 $11.75 $8.75 $10.98 $8.99 $3.39 $8.99 $3.99 $8.99 $2.18 $2.18 $10.98 $1.09 $8.99 $2.39 $9.25 $11.25 $11.75 $2.15 $4.45 $9.25 $11.25 $8.75 $8.99 $8.99 $3.39 $8.99 $10.98 $8.99 $1.69 $8.99 $3.99 $8.75 $4.45 $8.75 $8.75 $2.15 $8.75 $11.25 $2.15 $9.25 $8.75 $8.75 $9.25 $8.49 $8.99 $1.09 $9.25 $2.95 $11.75 $11.75 $9.25 $11.75 $4.45 $9.25 $4.45 $11.75 $8.75 $8.75 $4.45 $8.99 $8.99 $3.99 $8.49 $3.39 $8.99 $1.09 $9.25 $4.45 $8.75 $2.95 $4.45 $2.39 $8.49 $8.99 $8.49 $1.09 $8.99 $3.99 $8.75 $9.25 $4.45 $11.25 $4.45 $8.99 $1.09 $9.25 $2.95 $4.45 $11.75 $4.45 $8.49 $2.39 $10.98 $22.50 $22.50 $11.75 $4.45 $11.25 $4.45 $11.25 $4.45 $11.25 $11.25 $11.75 $9.25 $4.45 $11.48 $17.98 $17.98 $1.69 $17.50 $17.50 $4.45 $8.49 $2.39 $17.

In [143]:
df.groupby("item_price").apply(print)

      order_id  quantity      item_name choice_description item_price  \
28          14         1    Canned Soda       [Dr. Pepper]     $1.09    
34          17         1  Bottled Water                NaN     $1.09    
53          24         1    Canned Soda           [Sprite]     $1.09    
87          38         1  Bottled Water                NaN     $1.09    
107         47         1    Canned Soda       [Dr. Pepper]     $1.09    
...        ...       ...            ...                ...        ...   
3936      1578         1    Canned Soda  [Diet Dr. Pepper]     $1.09    
4001      1602         1  Bottled Water                NaN     $1.09    
4008      1604         1    Canned Soda        [Diet Coke]     $1.09    
4051      1621         1    Canned Soda           [Sprite]     $1.09    
4069      1629         1  Bottled Water                NaN     $1.09    

     total price  total_price  
28         $1.09       $1.09   
34         $1.09       $1.09   
53         $1.09       $1.0

  df.groupby("item_price").apply(print)


In [161]:
c = df.groupby("item_name")["quantity"].sum()
c.sort_values(ascending = True )
print(c)

item_name
6 Pack Soft Drink                         55
Barbacoa Bowl                             66
Barbacoa Burrito                          91
Barbacoa Crispy Tacos                     12
Barbacoa Salad Bowl                       10
Barbacoa Soft Tacos                       25
Bottled Water                            211
Bowl                                       4
Burrito                                    6
Canned Soda                              126
Canned Soft Drink                        351
Carnitas Bowl                             71
Carnitas Burrito                          60
Carnitas Crispy Tacos                      8
Carnitas Salad                             1
Carnitas Salad Bowl                        6
Carnitas Soft Tacos                       40
Chicken Bowl                             761
Chicken Burrito                          591
Chicken Crispy Tacos                      50
Chicken Salad                              9
Chicken Salad Bowl                       123
