### 数据加载与过滤

#### 在下面的测试中，你将实现 load_data() 函数，你也可以直接在项目中使用这个函数。共有以下四个步骤：

- **加载指定城市的数据集。** 索引 global CITY_DATA 字典对象来获取指定城市名对应的文件名。
- **创建 month 以及 day_of_week 列。**将 "Start Time" 列转换为 datetime，并使用 datetime 模块分列提取 month number（月份数）和 weekday name （星期名）。
- **根据月份筛选。**由于 month 参数是以月份名称给出的，你首先需要将其转换为对应的月份数字。接着，选择包含指定月份的 dataframe 行，并重新为 dataframe 赋值。
- **根据星期名筛选。**选择包含指定星期名的 dataframe 行，并重新为其分配 dataframe。（注意：请使用 title() 方法将 day 参数首字母大写，与 day_of_week 列中使用的首字母大写对应。）

In [2]:
import pandas as pd

In [3]:
CITY_DATA = {'chicago': 'chicago.csv',
             'new york city': 'new_york_city.csv',
             'washington': 'washington.csv'}

In [4]:
def load_data(city, month, day):
    """
    
    Loads data for the specified city and filters 
    by month and day if applicable.
    
    Args:
        (str) city - name of the city to analyze
        (str) month - name of the month to filter by, or 'all' to apply no month filiter
        (str) day - name of the day of week to filter by, or 'all' to apply no day filiter
    Returns:
        df - pandas DataFrame containing city data filtered by month and day
    """
    
    # load data file into a dataframe
    df = pd.read_csv(CITY_DATA[city])
    
    # convert the Start Time column to datatime
    df['Start Time'] = pd.to_datetime(df['Start Time'])
    
    # extract month and day of week from Start Time to create new columns
    df['month'] = df['Start Time'].dt.month
    df['day_of_week'] = df['Start Time'].dt.weekday_name
    df['hour'] = df['Start Time'].dt.hour
    
    # filter by month if applicable
    if month != 'all':
        # use the index of the months list to get the corresponding int
        months = ['january', 'february', 'march', 'april', 'may', 'june']
        month = months.index(month) + 1
        
        # filter by month to create the new dataframe
        df = df[df['month'] == month]
        
    # filter by day of week if applicable
    if day != 'all':
        # filter by day of week to create the new dataframe
        df = df[df['day_of_week'] == day.title()]
        
    return df
    

In [5]:
sample = load_data('chicago','february','Tuesday')

In [6]:
sample.head()

Unnamed: 0.1,Unnamed: 0,Start Time,End Time,Trip Duration,Start Station,End Station,User Type,Gender,Birth Year,month,day_of_week,hour
147,273417,2017-02-28 07:39:05,2017-02-28 07:45:04,359,Kingsbury St & Kinzie St,Michigan Ave & Lake St,Subscriber,Male,1991.0,2,Tuesday,7
174,275028,2017-02-28 10:03:30,2017-02-28 10:16:45,795,Damen Ave & Augusta Blvd,Winchester Ave & Elston Ave,Subscriber,Male,1972.0,2,Tuesday,10
243,175166,2017-02-14 21:03:32,2017-02-14 21:12:28,536,Ellis Ave & 60th St,Kimbark Ave & 53rd St,Subscriber,Female,1981.0,2,Tuesday,21
294,236346,2017-02-21 17:42:05,2017-02-21 17:50:49,524,Clark St & Congress Pkwy,Michigan Ave & 18th St,Subscriber,Male,1978.0,2,Tuesday,17
398,234788,2017-02-21 16:35:11,2017-02-21 16:51:10,959,Clark St & Randolph St,Loomis St & Taylor St (*),Subscriber,Male,1984.0,2,Tuesday,16


In [10]:
sample['hour'].value_counts().max()

723

In [7]:
sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 12 columns):
Unnamed: 0       300000 non-null int64
Start Time       300000 non-null datetime64[ns]
End Time         300000 non-null object
Trip Duration    300000 non-null int64
Start Station    300000 non-null object
End Station      300000 non-null object
User Type        300000 non-null object
Gender           238948 non-null object
Birth Year       238981 non-null float64
month            300000 non-null int64
day_of_week      300000 non-null object
hour             300000 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(4), object(6)
memory usage: 27.5+ MB


In [9]:
sample['Start Station'].mode()[0]

'Streeter Dr & Grand Ave'

In [10]:
sample['End Station'].mode()[0]

'Streeter Dr & Grand Ave'

In [25]:
sample.groupby(['Start Station','End Station']).size()

Start Station                 End Station                    
2112 W Peterson Ave           2112 W Peterson Ave                 1
                              Broadway & Granville Ave            1
                              Broadway & Thorndale Ave            3
                              Clark St & Berwyn Ave               5
                              Clark St & Bryn Mawr Ave            1
                              Clark St & Jarvis Ave               1
                              Clark St & Winnemac Ave             2
                              Lincoln Ave & Belle Plaine Ave      2
                              Maplewood Ave & Peterson Ave        1
                              Oakley Ave & Irving Park Rd         1
                              Ravenswood Ave & Balmoral Ave       2
                              Ravenswood Ave & Lawrence Ave       1
                              Sheridan Rd & Greenleaf Ave         1
                              Warren Park East        

In [27]:
common = sample.groupby(['Start Station','End Station']).size().idxmax()

In [29]:
sample.groupby(['Start Station','End Station']).size().loc[common[0],common[1]]

854