# Python training for data engineers
## 05. Data cleaning

### Goal
* Read the XML data from the previous notebook and add additional columns
* Read the JSON data and retrieve related information

### XML
In the first example we will deal with the XML response of the first request. Import lxml to handle the XML response we got in the web crawling step.

Make sure `lxml` is installed, or run `conda install lxml` first to install the library.

In [1]:
# Load data from previous step
import pickle
xmlcontent = pickle.load(open("xmlcontent_notebook_04.pickle", "rb"))

In [2]:
import lxml.html

and construct the [XML tree](https://www.w3schools.com/xml/xml_tree.asp)

In [3]:
xmltree = lxml.html.fromstring(xmlcontent)
xmltree

<Element html at 0x1122209a8>

The HTML source for the result page is shown in the following snippet.
```html
<!DOCTYPE html>
<html>
<head>
  <title></title>
</head>
<body>
  ...  
  <div class="package-snippet">
    <h3 class="package-snippet__title">
      <a href="/project/learning/">Learning</a>
      <span class="package-snippet__version">1.0.0</span>
    </h3>
    <p class="package-snippet__description">Tyro</p>
  </div>
    ...
</body>
</html>
```

Using the `tag` command we can see the top element of the XML tree, which indeed is the first element as we expect from the XML above:

In [4]:
xmltree.tag

'html'

We want to get all links from the table and add them to a list.

```html
<div class="package-snippet"><h3><a href="link"></a></h3>
```

Make sure `cssselect` is installed inside the environment.

In [5]:
for link in xmltree.cssselect('div[class=\'package-snippet\'] h3 a'):
    print(link.get('href'))

/project/scikit-learn/
/project/scikit-build/
/project/scikit-ci-addons/
/project/scikit-hep/
/project/scikit-cycling/
/project/scikit-ci/
/project/scikit-learn-runnr/
/project/scikit-allel/
/project/scikit-chem/
/project/scikit-vis/
/project/scikit-bio/
/project/scikit-monaco/
/project/scikits-learn/
/project/scikit-optimize/
/project/scikit-ribo/
/project/scikit-dataaccess/
/project/ninja/
/project/scikit-ued/
/project/scikit-nano/
/project/scikit-neuralnetwork/


Lets save them to an array:

In [6]:
# Create an empty list
list_of_links = []
# Loop through all the links in the table
for link in xmltree.cssselect('div[class=\'package-snippet\'] h3 a'):
    # Prepend with full URL and remove the version tag at the end by splitting by '/' and taking the relevant parts
    list_of_links.append('https://pypi.python.org' + link.get('href').rsplit('/',1)[0]+'/#files')

Show the first element of the list.

In [7]:
list_of_links[0]

'https://pypi.python.org/project/scikit-learn/#files'

#### Getting detailed information per package
Next we will loop over the links we retrieved in the previous step. Per detail page we will extract information about the packages.

Example URL: https://pypi.python.org/pypi/scikit-learn/0.19.0/#files

```html
<table class="table table--downloads">
    <thead>
        <tr>
            <th class="table__filename">
                    Filename, size &amp; hash
                    <a href="https://pip.pypa.io/en/stable/reference/pip_install/#hash-checking-mode" class="tooltipped tooltipped-n" aria-label="what's this?" data-original-label="what's this?" target="_blank"><i class="fa fa-question-circle" aria-hidden="true"></i><span class="sr-only">SHA256 hash help</span></a></th>
            <th class="table__type">File type</th>
            <th class="table__version">Python version</th>
            <th class="table__upload-date">Upload date</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
                <a href="https://files.pythonhosted.org/packages/de/d3/47c2c9842d61042f3c5f082f677dbe05899b077272105906a3249fe8c5da/scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl">
                      scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
                    </a>
                     (8.0 MB)
                    <a class="-js-copy-hash table__sha256-link tooltipped tooltipped-s" aria-label="Copy to clipboard" data-original-label="Copy to clipboard" data-clipboard-text="3775cca4ce3f94508bb7c8a6b113044b78c16b0a30a5c169ddeb6b9fe57a8a72"><i class="fa fa-copy" aria-hidden="true"></i><span class="sr-only">Copy SHA256 hash</span>
                      SHA256
                    </a></td>
            <td>
                    Wheel
                  </td>
            <td>
                    
                      cp27
                    
                  </td>
            <td>Oct 23, 2017</td>
        </tr>
    </tbody>
</table>
```

Retrieve information for the first link by requesting the URL and convert it to an XML tree.

In [8]:
import requests
response = requests.get(list_of_links[0])
xmltree = lxml.html.fromstring(response.content)

Next we will go through the table as depicted in the HTML snippet above, and extract the relevant information.


In [9]:
# Create dictionary
module_info = {}
for entry in xmltree.cssselect('table[class*=\'table--downloads\'] tr')[1:-1]:
    module_info['filename_size_hash'] = entry.cssselect('td')[0].text_content()
    module_info['file_type'] = entry.cssselect('td')[1].text_content()
    module_info['python_version'] = entry.cssselect('td')[2].text_content()
    module_info['uploaded_on'] = entry.cssselect('td')[3].text_content()
module_info

{'filename_size_hash': '\n                    \n                      scikit-learn-0.19.1.win-amd64-py2.7.exe\n                    \n                     (4.7 MB)\n                    \n                      \n                      Copy SHA256 hash\n                      SHA256\n                    \n                  ',
 'file_type': '\n                    Windows Installer\n                  ',
 'python_version': '\n                    \n                      2.7\n                    \n                  ',
 'uploaded_on': 'Oct 23, 2017'}

As we can see in the `module_info` different values have weird characters. We need to remove spaces and new line characters.

In [10]:
module_info = {}
for entry in xmltree.cssselect('table[class*=\'table--downloads\'] tr')[1:-1]:
    module_info['filename_size_hash'] = entry.cssselect('td')[0].text_content().strip().replace('\n', '')
    module_info['file_type'] = entry.cssselect('td')[1].text_content().strip().replace('\n', '').lower()
    module_info['python_version'] = entry.cssselect('td')[2].text_content().strip().replace('\n', '')
    module_info['uploaded_on'] = entry.cssselect('td')[3].text_content()
module_info

{'filename_size_hash': 'scikit-learn-0.19.1.win-amd64-py2.7.exe                                         (4.7 MB)                                                                Copy SHA256 hash                      SHA256',
 'file_type': 'windows installer',
 'python_version': '2.7',
 'uploaded_on': 'Oct 23, 2017'}

Lets put this in a function, so we can simply extract the data by calling the function with the URL's for the packages.

In [11]:
def get_info_from_link(url):
    response = requests.get(url)
    xmltree = lxml.html.fromstring(response.content)
    module_info_list = []
    for entry in xmltree.cssselect('table[class*=\'table--downloads\'] tr')[1:-1]:
        module_info = {}
        module_info['package_name'] = url.rsplit('/',2)[1]
        module_info['filename_size_hash'] = entry.cssselect('td')[0].text_content().strip().replace('\n', '')
        module_info['file_type'] = entry.cssselect('td')[1].text_content().strip().replace('\n', '').lower()
        try:
            module_info['python_version'] = entry.cssselect('td')[2].text_content().strip().replace('\n', '')
        except:
            pass # No version found
        try:
            module_info['uploaded_on'] = entry.cssselect('td')[3].text_content()
        except:
            pass # No date found
        module_info_list.append(module_info)
    return module_info_list

Execute the function for the first link.

In [12]:
info = get_info_from_link(list_of_links[0])
info

[{'package_name': 'scikit-learn',
  'filename_size_hash': 'scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl                                         (8.0 MB)                                                                Copy SHA256 hash                      SHA256',
  'file_type': 'wheel',
  'python_version': 'cp27',
  'uploaded_on': 'Oct 23, 2017'},
 {'package_name': 'scikit-learn',
  'filename_size_hash': 'scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686.whl                                         (11.4 MB)                                                                Copy SHA256 hash                      SHA256',
  'file_type': 'wheel',
  'python_version': 'cp27',
  'uploaded_on': 'Oct 23, 2017'},
 {'package_name': 'scikit-learn',
  'filename_size_hash': 'scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_64.whl                                         (12.2 MB)                                                  

From the info shown above, we can conclude that there are different versions for the module available.

Lets extract data for the second link.

In [13]:
info_two = get_info_from_link(list_of_links[1])
info_two

[{'package_name': 'scikit-build',
  'filename_size_hash': 'scikit_build-0.6.1-py2.py3-none-any.whl                                         (52.9 kB)                                                                Copy SHA256 hash                      SHA256',
  'file_type': 'wheel',
  'python_version': 'py2.py3',
  'uploaded_on': 'Jun 8, 2017'}]

We can now combine the two information blocks into one list by creating a new list.

In [14]:
total_info = [info, info_two]
total_info

[[{'package_name': 'scikit-learn',
   'filename_size_hash': 'scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl                                         (8.0 MB)                                                                Copy SHA256 hash                      SHA256',
   'file_type': 'wheel',
   'python_version': 'cp27',
   'uploaded_on': 'Oct 23, 2017'},
  {'package_name': 'scikit-learn',
   'filename_size_hash': 'scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686.whl                                         (11.4 MB)                                                                Copy SHA256 hash                      SHA256',
   'file_type': 'wheel',
   'python_version': 'cp27',
   'uploaded_on': 'Oct 23, 2017'},
  {'package_name': 'scikit-learn',
   'filename_size_hash': 'scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_64.whl                                         (12.2 MB)                                      

Lets extract the information for <u>all</u> the links to make a more interesting dataset.

In [15]:
# Initialize an empty list
all_info = []
for link in list_of_links:
    # Extract the data
    info_list = get_info_from_link(link)
    # Append the info to the big list
    all_info += info_list
all_info

[{'package_name': 'scikit-learn',
  'filename_size_hash': 'scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl                                         (8.0 MB)                                                                Copy SHA256 hash                      SHA256',
  'file_type': 'wheel',
  'python_version': 'cp27',
  'uploaded_on': 'Oct 23, 2017'},
 {'package_name': 'scikit-learn',
  'filename_size_hash': 'scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686.whl                                         (11.4 MB)                                                                Copy SHA256 hash                      SHA256',
  'file_type': 'wheel',
  'python_version': 'cp27',
  'uploaded_on': 'Oct 23, 2017'},
 {'package_name': 'scikit-learn',
  'filename_size_hash': 'scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_64.whl                                         (12.2 MB)                                                  

### Data conversion
Import pandas so we can start using dataframes with our retrieved data.

In [16]:
import pandas as pd

In [17]:
xmldf = pd.DataFrame.from_dict(all_info)
xmldf.head()

Unnamed: 0,file_type,filename_size_hash,package_name,python_version,uploaded_on
0,wheel,scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_int...,scikit-learn,cp27,"Oct 23, 2017"
1,wheel,scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686...,scikit-learn,cp27,"Oct 23, 2017"
2,wheel,scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_...,scikit-learn,cp27,"Oct 23, 2017"
3,wheel,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_i68...,scikit-learn,cp27,"Oct 23, 2017"
4,wheel,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_x86...,scikit-learn,cp27,"Oct 23, 2017"


Show the datatypes of the dataframe.

In [18]:
xmldf.dtypes

file_type             object
filename_size_hash    object
package_name          object
python_version        object
uploaded_on           object
dtype: object

Convert the `release_type` to a category.

In [19]:
xmldf['file_type'] = xmldf['file_type'].astype('category')

Convert the `uploaded_on` to a proper timestamp using `pd.to_datetime()`.

In [20]:
xmldf['uploaded_on'] = pd.to_datetime(xmldf['uploaded_on'])

Split the `filename_size_hash` into filename and size:

In [21]:
xmldf[['filename', 'size']] = xmldf['filename_size_hash'].str.extract('(\w.*?)\((.*?)\)', expand=True)

In [22]:
xmldf.head()

Unnamed: 0,file_type,filename_size_hash,package_name,python_version,uploaded_on,filename,size
0,wheel,scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_int...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_int...,8.0 MB
1,wheel,scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686...,11.4 MB
2,wheel,scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_...,12.2 MB
3,wheel,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_i68...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_i68...,11.4 MB
4,wheel,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_x86...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_x86...,12.2 MB


Split the `size` into a `size` and a `unit` by using a regular expression.

In [23]:
xmldf[['size', 'unit']] = xmldf['size'].str.extract('(\d*)\s(\w*?)$', expand=True)

In [24]:
xmldf.head()

Unnamed: 0,file_type,filename_size_hash,package_name,python_version,uploaded_on,filename,size,unit
0,wheel,scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_int...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_int...,0,MB
1,wheel,scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686...,4,MB
2,wheel,scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_...,2,MB
3,wheel,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_i68...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_i68...,4,MB
4,wheel,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_x86...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_x86...,2,MB


Fill the empty `size` values with a 0.

In [25]:
xmldf['size'] = xmldf['size'].fillna(0).astype('int')

Check the datatypes again.

In [26]:
xmldf.dtypes

file_type                   category
filename_size_hash            object
package_name                  object
python_version                object
uploaded_on           datetime64[ns]
filename                      object
size                           int64
unit                          object
dtype: object

In [27]:
xmldf.head()

Unnamed: 0,file_type,filename_size_hash,package_name,python_version,uploaded_on,filename,size,unit
0,wheel,scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_int...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_int...,0,MB
1,wheel,scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686...,4,MB
2,wheel,scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_...,2,MB
3,wheel,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_i68...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_i68...,4,MB
4,wheel,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_x86...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_x86...,2,MB


### Lambda function
Define a function to convert the size to bytes based on the unit.

In [28]:
def convert_to_bytes(size, unit):
    if unit == 'kB':
        size = size*1024
    elif unit == 'MB':
        size = size*1024*1024
    else:
        size = size
    return size

Apply the function to convert the sizes using the `lambda` function.

In [29]:
xmldf['size_in_bytes'] = xmldf.apply(lambda row: convert_to_bytes(row['size'], row['unit']), axis=1)
xmldf.head()

Unnamed: 0,file_type,filename_size_hash,package_name,python_version,uploaded_on,filename,size,unit,size_in_bytes
0,wheel,scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_int...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_int...,0,MB,0
1,wheel,scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686...,4,MB,4194304
2,wheel,scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_...,2,MB,2097152
3,wheel,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_i68...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_i68...,4,MB,4194304
4,wheel,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_x86...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_x86...,2,MB,2097152


In [30]:
xmldf.dtypes

file_type                   category
filename_size_hash            object
package_name                  object
python_version                object
uploaded_on           datetime64[ns]
filename                      object
size                           int64
unit                          object
size_in_bytes                  int64
dtype: object

In [31]:
xmldf.to_pickle('xml_dataframe_notebook_05.pickle')

## Handling the JSON data
Lets read the JSON data into a dataframe via a pickle:

In [32]:
jsoncontent = pickle.load(open("jsoncontent_notebook_04.pickle", "rb"))

In [33]:
jsoncontent

{'items': [{'has_synonyms': True,
   'is_moderator_only': False,
   'is_required': False,
   'count': 940527,
   'name': 'python'},
  {'has_synonyms': False,
   'is_moderator_only': False,
   'is_required': False,
   'count': 86864,
   'name': 'django'},
  {'has_synonyms': True,
   'is_moderator_only': False,
   'is_required': False,
   'count': 62823,
   'name': 'python-3.x'},
  {'has_synonyms': False,
   'is_moderator_only': False,
   'is_required': False,
   'count': 59204,
   'name': 'pandas'},
  {'has_synonyms': False,
   'is_moderator_only': False,
   'is_required': False,
   'count': 53225,
   'name': 'python-2.7'},
  {'has_synonyms': False,
   'is_moderator_only': False,
   'is_required': False,
   'count': 42574,
   'name': 'numpy'},
  {'has_synonyms': True,
   'is_moderator_only': False,
   'is_required': False,
   'count': 30503,
   'name': 'list'},
  {'has_synonyms': True,
   'is_moderator_only': False,
   'is_required': False,
   'count': 25829,
   'name': 'matplotlib'},
 

We need to convert the list of items to a dataframe:

In [34]:
jsondf = pd.DataFrame.from_dict(jsoncontent['items'])
jsondf.head()

Unnamed: 0,count,has_synonyms,is_moderator_only,is_required,name
0,940527,True,False,False,python
1,86864,False,False,False,django
2,62823,True,False,False,python-3.x
3,59204,False,False,False,pandas
4,53225,False,False,False,python-2.7


We can observe that the datatypes of the dataframe are inherited, in contrast to the XML experiment from the first section of this notebook.

In [35]:
jsondf.dtypes

count                 int64
has_synonyms           bool
is_moderator_only      bool
is_required            bool
name                 object
dtype: object

Lets extract all the Python related tags from the response.

In [36]:
DATA = {}
# Iterate over all the items
for item in jsoncontent['items']:
    # Check if python is absent in the DATA dictionary
    if 'python' not in DATA.keys():
        # Create a new empty dictionary for the python key in DATA
        DATA['python'] = {}
    # Add all non-python items to the python key
    if item['name'] != 'python':
        # Create a key in the python dictionary with the number of references
        DATA['python'][item['name']] = item['count']

Show the first tag.

In [37]:
for key in DATA['python'].keys():
    print(key)
    break

django


The above key is one of the many items related to Python according to the StackOverflow API. Lets use this key to extract all tags related to the key.

In [38]:
API_URL = 'https://api.stackexchange.com/2.2/tags/%s/related?pagesize=100&site=stackoverflow' % key

Retrieve the related tags for the first key found above.

In [39]:
response = requests.get(API_URL)
data = response.json()

for item in data['items']:
    if key not in DATA.keys():
        DATA[key] = {}
    if item['name'] != key:
        DATA[key][item['name']] = item['count']

Current dataframe:

In [40]:
pd.DataFrame.from_dict(DATA, orient='index')

Unnamed: 0,django,python-3.x,pandas,python-2.7,numpy,list,matplotlib,dictionary,regex,flask,...,foreign-keys,django-authentication,django-haystack,django-cms,django-south,amazon-web-services,many-to-many,models,virtualenv,django-class-based-views
django,,3657,,3659,,,,,,,...,1052.0,1048.0,1036.0,990.0,984.0,939.0,919.0,887.0,870.0,870.0
python,86864.0,62823,59204.0,53225,42574.0,30503.0,25829.0,21991.0,20116.0,17262.0,...,,,,,,,,,,


As we can observe we are creating a 2 dimensional matrix with a count between the horizontal and vertical item.

Show the existing keys:

In [41]:
print(DATA.keys())

dict_keys(['python', 'django'])


Lets create a function for the code we just ran.

In [42]:
def add_items_from_key(key):
    API_URL = 'https://api.stackexchange.com/2.2/tags/%s/related?pagesize=100&site=stackoverflow' % key
    response = requests.get(API_URL)
    data = response.json()

    for item in data['items']:
        if key not in DATA.keys():
            DATA[key] = {}
        if item['name'] != key:
            DATA[key][item['name']] = item['count']

Now iterate over the keys for Python and retrieve all the items per key.

In [43]:
for key in DATA['python'].keys():
    if (key not in DATA.keys()):
        add_items_from_key(key)

Create the final dataframe.

In [44]:
df = pd.DataFrame.from_dict(DATA, orient='index')
df.head()

Unnamed: 0,django,python-3.x,pandas,python-2.7,numpy,list,matplotlib,dictionary,regex,flask,...,output,urllib3,login,get,https,httprequest,http-post,http-headers,multipartform-data,grequests
algorithm,,,,,,870.0,,,,,...,,,,,,,,,,
arrays,,,,,9475.0,4291.0,,1931.0,,,...,,,,,,,,,,
beautifulsoup,123.0,1411.0,168.0,1188.0,,75.0,,49.0,421.0,,...,,,,,,,,,,
c++,,,,,,,,,,,...,,,,,,,,,,
class,,971.0,,572.0,,981.0,,468.0,,,...,,,,,,,,,,


Finally, replace all NaNs with zero:

In [45]:
df = df.fillna(0)

In [46]:
df.head()

Unnamed: 0,django,python-3.x,pandas,python-2.7,numpy,list,matplotlib,dictionary,regex,flask,...,output,urllib3,login,get,https,httprequest,http-post,http-headers,multipartform-data,grequests
algorithm,0.0,0.0,0.0,0.0,0.0,870.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
arrays,0.0,0.0,0.0,0.0,9475.0,4291.0,0.0,1931.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
beautifulsoup,123.0,1411.0,168.0,1188.0,0.0,75.0,0.0,49.0,421.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
c++,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
class,0.0,971.0,0.0,572.0,0.0,981.0,0.0,468.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [47]:
df.to_pickle("json_dataframe_notebook_05.pickle")