## Pandas怎样将处理结果数据存储到MySQL

一个典型的数据处理流：
1. Pandas从多方数据源读取数据，比如excel、csv、mysql、爬虫等等
2. Pandas对数据做过滤、统计分析
3. Pandas将数据存储到MySQL，用于Web页面展示、后续的进一步SQL分析等等

官网文档：  
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql

### 数据准备：统计每个电影的平均评分并且关联得到电影信息

In [1]:
import pandas as pd

In [5]:
df_ratings = pd.read_csv(
    "./datas/movielens-1m/ratings.dat", 
    sep="::",
    engine='python', 
    names="UserID::MovieID::Rating::Timestamp".split("::")
)
df_ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [6]:
df_ratings = df_ratings.groupby("MovieID")["Rating"].mean().reset_index()
df_ratings.head()

Unnamed: 0,MovieID,Rating
0,1,4.146846
1,2,3.201141
2,3,3.016736
3,4,2.729412
4,5,3.006757


In [7]:
df_movies = pd.read_csv(
    "./datas/movielens-1m/movies.dat", 
    sep="::",
    engine='python', 
    names="MovieID::Title::Genres".split("::")
)
df_movies.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [9]:
df_result = pd.merge(
    left=df_ratings, right=df_movies,
    left_on="MovieID", right_on="MovieID"
)
df_result.head()

Unnamed: 0,MovieID,Rating,Title,Genres
0,1,4.146846,Toy Story (1995),Animation|Children's|Comedy
1,2,3.201141,Jumanji (1995),Adventure|Children's|Fantasy
2,3,3.016736,Grumpier Old Men (1995),Comedy|Romance
3,4,2.729412,Waiting to Exhale (1995),Comedy|Drama
4,5,3.006757,Father of the Bride Part II (1995),Comedy


In [10]:
df_result.set_index("MovieID", inplace=True)
df_result.head()

Unnamed: 0_level_0,Rating,Title,Genres
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,4.146846,Toy Story (1995),Animation|Children's|Comedy
2,3.201141,Jumanji (1995),Adventure|Children's|Fantasy
3,3.016736,Grumpier Old Men (1995),Comedy|Romance
4,2.729412,Waiting to Exhale (1995),Comedy|Drama
5,3.006757,Father of the Bride Part II (1995),Comedy


In [18]:
df_result.shape

(3706, 3)

### 创建sqlalchemy对象连接MySQL

SQLAlchemy是Python中的ORM框架，
Object-Relational Mapping，把关系数据库的表结构映射到对象上。

* 官网：https://www.sqlalchemy.org/
* 如果sqlalchemy包不存在，用这个命令安装：pip install sqlalchemy

可以把它当做pymysql来使用，可以直接执行SQL语句

In [11]:
from sqlalchemy import create_engine

In [12]:
engine = create_engine("mysql+pymysql://root:123456@127.0.0.1:3306/test?charset=utf8", echo=False)

### 方法1：当数据表不存在时，每次覆盖整个表

每次运行会drop table，新建表

In [13]:
df_result.to_sql(name='move_ratings', con=engine, if_exists="replace")

  result = self._query(query)


In [15]:
engine.execute("select count(1) from move_ratings").first()

(3706,)

### 方法2：当数据表存在时，每次新增数据

场景：每天会新增一部分数据，要添加到数据表，怎么处理？

In [19]:
df_new = df_result.loc[:4, :]
df_new

Unnamed: 0_level_0,Rating,Title,Genres
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,4.146846,Toy Story (1995),Animation|Children's|Comedy
2,3.201141,Jumanji (1995),Adventure|Children's|Fantasy
3,3.016736,Grumpier Old Men (1995),Comedy|Romance
4,2.729412,Waiting to Exhale (1995),Comedy|Drama


In [20]:
df_new.to_sql(name='move_ratings', con=engine, if_exists="append")

In [21]:
engine.execute("SELECT * FROM move_ratings where MovieID<5 ").fetchall()

[(1, 4.14684641309581, 'Toy Story (1995)', "Animation|Children's|Comedy"),
 (1, 4.14684641309581, 'Toy Story (1995)', "Animation|Children's|Comedy"),
 (2, 3.20114122681883, 'Jumanji (1995)', "Adventure|Children's|Fantasy"),
 (2, 3.20114122681883, 'Jumanji (1995)', "Adventure|Children's|Fantasy"),
 (3, 3.01673640167364, 'Grumpier Old Men (1995)', 'Comedy|Romance'),
 (3, 3.01673640167364, 'Grumpier Old Men (1995)', 'Comedy|Romance'),
 (4, 2.72941176470588, 'Waiting to Exhale (1995)', 'Comedy|Drama'),
 (4, 2.72941176470588, 'Waiting to Exhale (1995)', 'Comedy|Drama')]

#### 问题解决：先根据数据KEY删除旧数据

In [22]:
df_new.index

Int64Index([1, 2, 3, 4], dtype='int64', name='MovieID')

In [23]:
for movie_id in df_new.index:
    ## 先删除要新增的数据
    delete_sql = f"delete from move_ratings where MovieID={movie_id}"
    print(delete_sql)
    engine.execute(delete_sql)

delete from move_ratings where MovieID=1
delete from move_ratings where MovieID=2
delete from move_ratings where MovieID=3
delete from move_ratings where MovieID=4


In [24]:
engine.execute("SELECT * FROM move_ratings where MovieID<5 ").fetchall()

[]

In [25]:
engine.execute("select count(1) from move_ratings").first()

(3702,)

In [26]:
# 新增数据到表中
df_new.to_sql(name='move_ratings', con=engine, if_exists="append")

In [27]:
engine.execute("SELECT * FROM move_ratings where MovieID<5 ").fetchall()

[(1, 4.14684641309581, 'Toy Story (1995)', "Animation|Children's|Comedy"),
 (2, 3.20114122681883, 'Jumanji (1995)', "Adventure|Children's|Fantasy"),
 (3, 3.01673640167364, 'Grumpier Old Men (1995)', 'Comedy|Romance'),
 (4, 2.72941176470588, 'Waiting to Exhale (1995)', 'Comedy|Drama')]

In [28]:
engine.execute("SELECT count(1) FROM move_ratings").first()

(3706,)