### 数据加载与过滤

#### 在下面的测试中，你将实现 load_data() 函数，你也可以直接在项目中使用这个函数。共有以下四个步骤：

- **加载指定城市的数据集。** 索引 global CITY_DATA 字典对象来获取指定城市名对应的文件名。
- **创建 month 以及 day_of_week 列。**将 "Start Time" 列转换为 datetime，并使用 datetime 模块分列提取 month number（月份数）和 weekday name （星期名）。
- **根据月份筛选。**由于 month 参数是以月份名称给出的，你首先需要将其转换为对应的月份数字。接着，选择包含指定月份的 dataframe 行，并重新为 dataframe 赋值。
- **根据星期名筛选。**选择包含指定星期名的 dataframe 行，并重新为其分配 dataframe。（注意：请使用 title() 方法将 day 参数首字母大写，与 day_of_week 列中使用的首字母大写对应。）

In [13]:
import pandas as pd

In [14]:
CITY_DATA = {'chicago': 'chicago.csv',
             'new york city': 'new_york_city.csv',
             'washington': 'washington.csv'}

In [15]:
def load_data(city, month, day):
    """
    
    Loads data for the specified city and filters 
    by month and day if applicable.
    
    Args:
        (str) city - name of the city to analyze
        (str) month - name of the month to filter by, or 'all' to apply no month filiter
        (str) day - name of the day of week to filter by, or 'all' to apply no day filiter
    Returns:
        df - pandas DataFrame containing city data filtered by month and day
    """
    
    # load data file into a dataframe
    df = pd.read_csv(CITY_DATA[city])
    
    # convert the Start Time column to datatime
    df['Start Time'] = pd.to_datetime(df['Start Time'])
    
    # extract month and day of week from Start Time to create new columns
    df['month'] = df['Start Time'].dt.month
    df['day_of_week'] = df['Start Time'].dt.weekday_name
    df['hour'] = df['Start Time'].dt.hour
    
    # filter by month if applicable
    if month != 'all':
        # use the index of the months list to get the corresponding int
        months = ['january', 'february', 'march', 'april', 'may', 'june']
        month = months.index(month) + 1
        
        # filter by month to create the new dataframe
        df = df[df['month'] == month]
        
    # filter by day of week if applicable
    if day != 'all':
        # filter by day of week to create the new dataframe
        df = df[df['day_of_week'] == day.title()]
        
    return df
    

In [16]:
sample = load_data('chicago','february','Tuesday')

In [17]:
sample.head()

Unnamed: 0.1,Unnamed: 0,Start Time,End Time,Trip Duration,Start Station,End Station,User Type,Gender,Birth Year,month,day_of_week,hour
147,273417,2017-02-28 07:39:05,2017-02-28 07:45:04,359,Kingsbury St & Kinzie St,Michigan Ave & Lake St,Subscriber,Male,1991.0,2,Tuesday,7
174,275028,2017-02-28 10:03:30,2017-02-28 10:16:45,795,Damen Ave & Augusta Blvd,Winchester Ave & Elston Ave,Subscriber,Male,1972.0,2,Tuesday,10
243,175166,2017-02-14 21:03:32,2017-02-14 21:12:28,536,Ellis Ave & 60th St,Kimbark Ave & 53rd St,Subscriber,Female,1981.0,2,Tuesday,21
294,236346,2017-02-21 17:42:05,2017-02-21 17:50:49,524,Clark St & Congress Pkwy,Michigan Ave & 18th St,Subscriber,Male,1978.0,2,Tuesday,17
398,234788,2017-02-21 16:35:11,2017-02-21 16:51:10,959,Clark St & Randolph St,Loomis St & Taylor St (*),Subscriber,Male,1984.0,2,Tuesday,16


In [18]:
sample['hour'].value_counts().max()

723

In [19]:
sample['Trip Duration'].value_counts().sum()

4911

In [20]:
sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4911 entries, 147 to 299904
Data columns (total 12 columns):
Unnamed: 0       4911 non-null int64
Start Time       4911 non-null datetime64[ns]
End Time         4911 non-null object
Trip Duration    4911 non-null int64
Start Station    4911 non-null object
End Station      4911 non-null object
User Type        4911 non-null object
Gender           4655 non-null object
Birth Year       4655 non-null float64
month            4911 non-null int64
day_of_week      4911 non-null object
hour             4911 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(4), object(6)
memory usage: 498.8+ KB


In [62]:
sample.sort_values('Birth Year',ascending=False).iloc[0]['Birth Year']

2000.0

In [58]:
sample['Birth Year'].sort_values(ascending=False)

19338     2000.0
142996    2000.0
29228     2000.0
154692    1999.0
1027      1999.0
118570    1999.0
31861     1999.0
270771    1998.0
72535     1998.0
24357     1998.0
119884    1998.0
122311    1998.0
183411    1998.0
279070    1998.0
258722    1998.0
254239    1998.0
235958    1997.0
287521    1997.0
235302    1997.0
83228     1997.0
274619    1997.0
67605     1997.0
61283     1997.0
56336     1997.0
174296    1997.0
59670     1997.0
184743    1997.0
183772    1997.0
92239     1997.0
272585    1997.0
           ...  
264095       NaN
266045       NaN
267087       NaN
269393       NaN
271053       NaN
271855       NaN
274497       NaN
275337       NaN
275383       NaN
276715       NaN
277814       NaN
279823       NaN
280753       NaN
281728       NaN
281947       NaN
282115       NaN
284547       NaN
285232       NaN
287894       NaN
290581       NaN
291928       NaN
293614       NaN
293935       NaN
293977       NaN
294277       NaN
295660       NaN
296156       NaN
297054       N

In [21]:
sample['Start Station'].mode()[0]

'Clinton St & Washington Blvd'

In [22]:
sample['End Station'].mode()[0]

'Clinton St & Washington Blvd'

In [23]:
sample.groupby(['Start Station','End Station']).size()

Start Station                 End Station                        
2112 W Peterson Ave           Broadway & Granville Ave               1
900 W Harrison St             Financial Pl & Congress Pkwy           2
                              Green St & Madison St                  1
                              Halsted St & Polk St                   1
                              Morgan St & 18th St                    1
                              Morgan St & Lake St                    2
                              Western Ave & Congress Pkwy            1
                              Wood St & Division St                  1
Aberdeen St & Jackson Blvd    Aberdeen St & Monroe St                1
                              Ashland Ave & Chicago Ave              1
                              Damen Ave & Chicago Ave                1
                              Financial Pl & Congress Pkwy           2
                              Franklin St & Jackson Blvd             1
           

In [32]:
sample['Start Station'].value_counts().max()

122

In [29]:
common = sample.groupby(['Start Station','End Station']).size()

In [30]:
common

Start Station                 End Station                        
2112 W Peterson Ave           Broadway & Granville Ave               1
900 W Harrison St             Financial Pl & Congress Pkwy           2
                              Green St & Madison St                  1
                              Halsted St & Polk St                   1
                              Morgan St & 18th St                    1
                              Morgan St & Lake St                    2
                              Western Ave & Congress Pkwy            1
                              Wood St & Division St                  1
Aberdeen St & Jackson Blvd    Aberdeen St & Monroe St                1
                              Ashland Ave & Chicago Ave              1
                              Damen Ave & Chicago Ave                1
                              Financial Pl & Congress Pkwy           2
                              Franklin St & Jackson Blvd             1
           

In [25]:
sample.groupby(['Start Station','End Station']).size().loc[common[0],common[1]]

9