#  数据探索性分析和清理

我们的目的是了解我们的数据集，并做一个基本的清理，去除NaN和重复的数据

# 1.导入库

In [29]:
import pandas as pd
import numpy as np

# 2.读取数据

In [30]:
df = pd.read_csv('../data/raw_data.csv', low_memory=False,compression='gzip')


In [31]:
len(df)

8381556

In [32]:
df.head()

Unnamed: 0,ts,number,pick_lat,pick_lng,drop_lat,drop_lng
0,2020-03-26 07:07:17,14626,12.313621,76.658195,12.287301,76.60228
1,2020-03-26 07:32:27,85490,12.943947,77.560745,12.954014,77.54377
2,2020-03-26 07:36:44,5408,12.899603,77.5873,12.93478,77.56995
3,2020-03-26 07:38:00,58940,12.918229,77.607544,12.968971,77.636375
4,2020-03-26 07:39:29,5408,12.89949,77.58727,12.93478,77.56995


### 首先我们先来明确一下各个列的含义：
* ts: 时间戳
* number: 用户ID
* pick_lat: 上车纬度
* pick_lng: 上车经度
* drop_lat: 下车纬度
* fdrop_lng: 下车经度


**这里同学们需要明确一下,在特定的时间戳上只会有唯一一个用户ID**，也就是说我们这里可以删除重复的值。

## 3.删除重复值和空值

In [33]:
df[df.duplicated(subset=['ts','number'],keep=False)]

Unnamed: 0,ts,number,pick_lat,pick_lng,drop_lat,drop_lng
235,2020-03-26 18:10:35,16795,12.967236,77.641594,13.014504,77.650856
236,2020-03-26 18:10:35,16795,12.967236,77.641594,13.014504,77.650856
407,2020-03-26 21:35:50,65856,12.917173,77.586400,12.913940,77.685280
408,2020-03-26 21:35:50,65856,12.917173,77.586400,12.913940,77.685280
443,2020-03-26 23:26:29,27554,12.933715,77.619300,12.938208,77.587520
...,...,...,...,...,...,...
8381231,2021-03-26 22:23:12,61636,12.975229,77.620370,13.017285,77.618200
8381245,2021-03-26 22:25:13,61636,12.975229,77.620370,13.017285,77.618200
8381246,2021-03-26 22:25:13,61636,12.975229,77.620370,13.017285,77.618200
8381248,2021-03-26 22:25:27,61636,12.975229,77.620370,13.017285,77.618200


**总共113540条数据，有重复数据**

In [34]:
df.drop_duplicates(subset=['ts','number'],inplace=True,keep='last')

df.reset_index(inplace=True,drop=True)

In [35]:
# 查看以下df信息
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8315498 entries, 0 to 8315497
Data columns (total 6 columns):
 #   Column    Dtype  
---  ------    -----  
 0   ts        object 
 1   number    object 
 2   pick_lat  float64
 3   pick_lng  float64
 4   drop_lat  float64
 5   drop_lng  float64
dtypes: float64(4), object(2)
memory usage: 380.7+ MB


###  处理空值

In [36]:
np.count_nonzero(df.isnull().values)

0

In [37]:
df['number'] = pd.to_numeric(df.number,errors='coerce')

np.count_nonzero(df.isnull().values)

116

#### **我们发现有转换完成之后有116个值是空值**，把他们删掉

In [38]:
df.dropna(inplace=True)
len(df)

8315382

## 4.处理时间变量ts

In [40]:
df['ts'] = pd.to_datetime(df.ts)

In [41]:
df['mins'] = df['ts'].dt.minute
df['hour'] = df['ts'].dt.hour
df['day'] = df['ts'].dt.day
df['month'] = df['ts'].dt.month
df['year'] = df['ts'].dt.year
df['dayofweek'] = df['ts'].dt.dayofweek

## 5.导出数据

In [42]:
df.to_csv("../data/preprocessed_1.csv",index=False,compression='gzip')