Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rounding errors found in IDs #24

Closed
3 tasks done
kmcelwee opened this issue Oct 10, 2020 · 10 comments
Closed
3 tasks done

Rounding errors found in IDs #24

kmcelwee opened this issue Oct 10, 2020 · 10 comments

Comments

@kmcelwee
Copy link
Owner

kmcelwee commented Oct 10, 2020

Oembed process revealed IDs that had been rounded:

1265004698148495400
1265812305817862100
1266080125805834200

This exists within the dataset in fortune-100-blm-dataset. I believe I manually entered values for Lowe's because of API limits, so that might be what's happening.

  • Request oembed for all tweets ending in 00 to double check that it's limited to these tweets.
  • Dig into fortune-100-blm-dataset repo and double check scripts.
  • Pandas automatically reads the ID column as an integer. Research that this doesn't cause issues.
@kmcelwee
Copy link
Owner Author

df['ID'].astype(str).apply(lambda x: len(x)).value_counts()

outputs:

19    83280
18    54160
11      117
17       87
10       38
16       11

Meaning a majority of tweets have 19 digits.

@kmcelwee
Copy link
Owner Author

The maximum ID value as an integer is 1287171305138204672, which is greater than all the Lowe's values that were rounded, supporting the argument that it's just the Lowe's tweets.

@kmcelwee
Copy link
Owner Author

Looking only at IDs that end in 00, leaves us with 3158 IDs

all_ids = [x for x in df['ID'].astype(str).tolist() if x[-2:] == '00']

Using get_oembed(tweet_id) we get the following tweets that raised errors:

1139202878801715200
1063239357237248000
1192497953505792000
1266080125805834200
1265812305817862100
1265004698148495400
1176691432100249600

@kmcelwee
Copy link
Owner Author

kmcelwee commented Oct 11, 2020

1139202878801715200
Nike
Thu Jun 13 16:08:26 +0000 2019

It doesn’t matter what you play. Nobody wins alone. #BeTrue #UntilWeAllWin

@caster800m @TheChrisMosier @ScoutBassett @KerronClement @MarkMcKenzie4_ @ EricKoston @S10bird @brittneyGriner @jordin_canada @jewellloyd https://t.co/veA9PtqwbW

✅ confirmed. This was deleted.

@kmcelwee
Copy link
Owner Author

1063239357237248000,Exelon,Fri Nov 16 01:16:31 +0000 2018,,

RT @ Amartines: “HR does not solely own the responsibility for ensuring diversity. Leaders need to be accountable for the make up of their t…

Cannot scroll back far enough. Feed for Exelon stops in 2019. The original tweet exists though. Not exactly sure what happened here.

@kmcelwee
Copy link
Owner Author

1192497953505792000,IBM,Thu Nov 07 17:44:02 +0000 2019,

What does a day without IBM look like?

Watch Techless, where people must complete seemingly simple tasks without using anything that was invented by IBM or could use our technology: https://t.co/GRvjE2fz6s https://t.co/tL5UnsfCbW

✅ confirmed. This was deleted.

@kmcelwee
Copy link
Owner Author

1266080125805834200
1265812305817862100
1265004698148495400

✅ Are all the Lowe's tweets we know about

@kmcelwee
Copy link
Owner Author

1176691432100249600,Facebook,Wed Sep 25 02:54:33 +0000 2019

RT @ boztank: See you tomorrow at #OC6

https://t.co/oFTviQaIyr

✅ Looks like the original tweet was deleted

@kmcelwee
Copy link
Owner Author

Seems like in the raw data pull (fortune-100-blm-dataset/data/fortune-100-json/Lowes.json), the id did not match id_str. Test is added to test.py to check for this.

@kmcelwee
Copy link
Owner Author

Pandas supports 64 bit integers by default, and Twitter suggests that's what it's using. Still can't figure out how that error creeped in, but it should be all set now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant