## What Is Data Science?

> This is a book about doing data science with Python, which immediately begs the question: what is *data science*?
It's a surprisingly hard definition to nail down, especially given how ubiquitous the term has become.
Vocal critics have variously dismissed the term as a superfluous label (after all, what science doesn't involve data?) or a simple buzzword that only exists to salt resumes and catch the eye of overzealous tech recruiters.

這是一本介紹使用Python完成數據科學工作的書，那麼立刻就會帶來一個問題：什麼是數據科學？這是一個十分難以定義的概念，尤其是這幾年這個術語幾乎隨處可見。批評的聲音認為這是一個多餘的標籤（畢竟，哪樣科學不包含數據呢？）或者這只是一個為了博取關注而產生的流行詞彙。
> In my mind, these critiques miss something important.
Data science, despite its hype-laden veneer, is perhaps the best label we have for the cross-disciplinary set of skills that are becoming increasingly important in many applications across industry and academia.
This cross-disciplinary piece is key: in my mind, the best extisting definition of data science is illustrated by Drew Conway's Data Science Venn Diagram, first published on his blog in September 2010:

這些批評忽略了一些重要的東西。數據科學除了部分炒作的成分外，可能是目前我們能夠找到的最合適的詞彙來表達這種跨學科領域的技術了，特別是越來越多的工業和學術應用都在使用它。這裡的關鍵是跨學科領域：最好表達數據科學的定義的方式是2010年 Drew Conway在他的Blog 發表的這張圖：

![Data Science Venn Diagram](https://raw.githubusercontent.com/wangyingsm/Python-Data-Science-Handbook/7fac9497d573c8f3ea3545b7fcb0a98d59e1c9cb/notebooks/figures/Data_Science_VD.png)

<small>(Source: [Drew Conway](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram). Used by permission.)</small>

> While some of the intersection labels are a bit tongue-in-cheek, this diagram captures the essence of what I think people mean when they say "data science": it is fundamentally an *interdisciplinary* subject.
Data science comprises three distinct and overlapping areas: the skills of a *statistician* who knows how to model and summarize datasets (which are growing ever larger); the skills of a *computer scientist* who can design and use algorithms to efficiently store, process, and visualize this data; and the *domain expertise*—what we might think of as "classical" training in a subject—necessary both to formulate the right questions and to put their answers in context.

雖然圖中，圓形重疊部分的標籤看起來很有些嘲諷的意味，但這張圖把握了當人們使用“數據科學”這個術語時候的精髓：最根本來說，數據科學是一門交叉學科。數據科學有三個領域交叉而成：
- 需要統計學家來對數據集（正在變得越來越巨大）進行建模和統計；
- 需要計算學家來使用算法有效地存儲、處理和展現這些數據；
- 需要領域專家（通常在傳統意義上我們就是這麼做的）來在相關垂直領域整理出正確的問題和相應的解決方法。

> With this in mind, I would encourage you to think of data science not as a new domain of knowledge to learn, but a new set of skills that you can apply within your current area of expertise.
Whether you are reporting election results, forecasting stock returns, optimizing online ad clicks, identifying microorganisms in microscope photos, seeking new classes of astronomical objects, or working with data in any other field, the goal of this book is to give you the ability to ask and answer new questions about your chosen subject area.

根據上述解釋，讀者與其將數據科學當成是一個新的知識領域來學習，還不如將你已有的專業知識融會貫通，發展出新的數據科學技巧。無論你是在統計選舉結果、預測股市回報、優化在線廣告點擊、在顯微鏡圖像中識別微小組織、尋找一類新的天文物體、或者是其他任何與數據相關的工作，就是為你提供一種新的能力來提出和解答該領域的相關問題。

## Why Python?

> Python has emerged over the last couple decades as a first-class tool for scientific computing tasks, including the analysis and visualization of large datasets.
This may have come as a surprise to early proponents of the Python language: the language itself was not specifically designed with data analysis or scientific computing in mind.
The usefulness of Python for data science stems primarily from the large and active ecosystem of third-party packages: *NumPy* for manipulation of homogeneous array-based data, *Pandas* for manipulation of heterogeneous and labeled data, *SciPy* for common scientific computing tasks, *Matplotlib* for publication-quality visualizations, *IPython* for interactive execution and sharing of code, *Scikit-Learn* for machine learning, and many more tools that will be mentioned in the following pages.

Python在最近20年已經發展成為科學計算包括分析和展示大型數據集的最流行工具。這對於Python語言的早期支持者來說是一個驚喜：因為這門語言本身並不是專門為了數據分析和科學計算來設計的。 Python在數據科學中的蓬勃發展主要來源於其大量活躍的第三方包：Numpy用於處理同類的數組結構數據；Pandas用於處理不同種類和標籤化的數據；SciPy用於通用的科學運算任務；Matplotlib用於可打印標準的圖表展示；IPython用於交互式執行和共享代碼；Scikit-Learn用於機器學習，這些工具將在後續的章節中介紹。

## A Python Integer Is More Than Just an Integer

> The standard Python implementation is written in C.
This means that every Python object is simply a cleverly-disguised C structure, which contains not only its value, but other information as well. For example, when we define an integer in Python, such as ``x = 10000``, ``x`` is not just a "raw" integer. It's actually a pointer to a compound C structure, which contains several values.
Looking through the Python 3.4 source code, we find that the integer (long) type definition effectively looks like this (once the C macros are expanded)，This means that there is some overhead in storing an integer in Python as compared to an integer in a compiled language like C, as illustrated in the following figure::

標準的Python實現是使用C語言編寫的。這意味著每個Python當中的對像都是一個偽裝良好的C結構體，結構體內不僅僅包括它的值，還有其他的信息。例如，當我們在Python中定義了一個整數，比方說'x=10000'，x不僅僅是一個原始的整數，它在底層實際上是一個指向複雜C結構體的指針，裡面含有若干個字段。當你查閱Python 3.4的源代碼的時候，你會發現整數（實際上是長整形）的定義如下（我們將C語言中的宏定義展開後），這意味著在Python中存儲一個整數要比在像C這樣的編譯語言中存儲一個整數要有損耗，就像下圖展示的那樣：
```C
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
```

## Outline

> Each chapter of this book focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story.
> 1. IPython and Jupyter: these packages provide the computational environment in which many Python-using data scientists work.
> 2. NumPy: this library provides the ``ndarray`` for efficient storage and manipulation of dense data arrays in Python.
> 3. Pandas: this library provides the ``DataFrame`` for efficient storage and manipulation of labeled/columnar data in Python.
> 4. Matplotlib: this library provides capabilities for a flexible range of data visualizations in Python.
> 5. Scikit-Learn: this library provides efficient & clean Python implementations of the most important and established ML algorithms.

本書的每一章都聚焦於一個特定的包或工具，它對數據科學某個方面都有重要的應用和幫助。

1. IPython 和 Jupyter: 這兩個包提供了使用Python的數據科學家最喜愛的計算環境。
2. NumPy: 這個包提供了ndarray對像用於有效的存儲和處理數組中的非稀疏數據。
3. Pandas: 這個包提供了DataFrame對像用於有效的存儲和處理標籤化的基於列結構的數據。
4. Matplotlib: 這個包提供了最靈活的數據圖表展示功能。
5. Scikit-Learn: 這個包提供了很多重要的機器學習算法以及有效和簡潔的Python實現。

## Using Code Examples

> Supplemental material (code examples, figures, etc.) is available for download at http://github.com/jakevdp/PythonDataScienceHandbook/. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

本書附帶的資源（代碼示例，圖表等）可以在 http://github.com/wangyingsm/Python-Data-Science-Handbook/ 下載。本書的代碼例子是為了幫助你理解內容。在通常意義下，本書附帶的代碼可以被使用在你的程序和文檔中。你不需要聯繫作者獲得授權，除非你在修改或重構代碼非常重要的部分。例如，使用本書的代碼編寫你的程序不需要獲得作者授權；銷售和分發本書的代碼不需要獲得作者的授權；引用本書或書中的代碼例子回答問題不需要獲得作者的授權。將本書大部分的代碼例子組織在你產品的文檔中確實需要獲得作者的授權。

> We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example:
> *The Python Data Science Handbook* by Jake VanderPlas (O’Reilly). Copyright 2016 Jake VanderPlas, 978-1-491-91205-8.
> If you feel your use of code examples falls outside fair use or the per‐ mission given above, feel free to contact us at permissions@oreilly.com.

雖然不是必須的，但是如果你在引用時聲明了標題、作者、出版社和ISBN的話，作者會很感激。如果你認為你對於代碼例子的使用超出了上述的授權範圍，請聯繫 permissions@oreilly.com。

### Beyond tab completion: wildcard matching (Tab)

> Tab completion is useful if you know the first few characters of the object or attribute you're looking for, but is little help if you'd like to match characters at the middle or end of the word.
For this use-case, IPython provides a means of wildcard matching for names using the ``*`` character.

Tab 對於你知道對像或屬性的頭幾個字母的情況下非常有效，但是如果你只記得中間或末尾處的字符時，tab 就無法發揮了。對於這種情況，IPython提供了一種使用通配符`*`來匹配內容的方法。例如，我们可以使用它列出任何末尾为`Warning`的对象：

```python
In [10]: *Warning?
BytesWarning                  RuntimeWarning
DeprecationWarning            SyntaxWarning
FutureWarning                 UnicodeWarning
ImportWarning                 UserWarning
PendingDeprecationWarning     Warning
ResourceWarning
```

> Notice that the ``*`` character matches any string, including the empty string.Similarly, suppose we are looking for a string method that contains the word ``find`` somewhere in its name.
We can search for it this way:

這裡的`*`號能匹配任何字符串，包括空字符串。類似的，如果我們希望找到所有名稱中含有`find`字符串的對象內容，我們可以這樣做：

```python
In [10]: str.*find*?
str.find
str.rfind
```

In [8]:
str.*find*?

- Python For Data Analysis 2nd

```
In []: an_apple = 27
In []: an_example = 42
In []: an <TAB>
```


In [10]:
an_apple = 27
an_example = 42
#an <TAB>

```
In []: b = [1, 2, 3]
In []: b. <TAB>
```

In [11]:
b = [1, 2, 3]
#b. <TAB>

```
In []: import datetime
In []: datetime.
```

In [12]:
import datetime
# datetime.<TAB>

```
In []: np.*load*?
```

In [13]:
import numpy as np
np.*load*?

### The %run Command

In [29]:
b = [1,2,3]
b?

```python
Type:        list
String form: [1, 2, 3]
Length:      3
```

In [30]:
print?

```python
Docstring:
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
```

In [32]:
def add_numbers(a, b):
    return a+b
add_numbers?

```python
Signature: add_numbers(a, b)
Docstring: <no docstring>
File:      /var/folders/9q/486czkcn7lv5v0hwbt71twdc0000gn/T/ipykernel_8138/108992557.py
Type:      function
```

In [33]:
add_numbers??

```python
Signature: add_numbers(a, b)
Docstring: <no docstring>
```