Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Analysis with Python——01 #75

Open
hsipeng opened this issue Mar 18, 2019 · 0 comments
Open

Data Analysis with Python——01 #75

hsipeng opened this issue Mar 18, 2019 · 0 comments

Comments

@hsipeng
Copy link
Owner

hsipeng commented Mar 18, 2019

billy-gov

requirement

  • numpy
  • pandas
  • matplotlib
时区分析
import json
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
%matplotlib inline 


# json
path = 'datasets/example.txt'
records = [json.loads(line) for line in open(path)]

# pandas DataFrame
frame = DataFrame(records)
frame

# Munge  数据规整
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz==''] = 'Unkown'
tz_counts2= clean_tz.value_counts()
tz_counts2[:10]

# matplot 
tz_counts2[:10].plot(kind='barh', rot=0)
Agent 分析
# 接上面
results = Series([x.split()[0] for x in frame.a.dropna()])
results[:5]

results.value_counts()[:8]

# remove null
cframe = frame[frame.a.notnull()]

operating_system = np.where(cframe['a'].str.contains('Windows'),'Windows', 'Not Windows')
operating_system[:5]

by_tz_os = cframe.groupby(['tz',operating_system])
agg_counts = by_tz_os.size().unstack().fillna(0)
agg_counts[:10]

# index func
indexer = agg_counts.sum(1).argsort()

indexer[:10]

count_subset = agg_counts.take(indexer)[-10:]
count_subset
# plot
count_subset.plot(kind='barh', stacked=True)
# to see clearly, another plot
normed_subset = count_subset.div(count_subset.sum(1), axis=0)
normed_subset.plot(kind='barh', stacked=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant